conda create -n viper python==3.9.0
conda activate viper
cd alfworld/TextWorld
```
conda install cython
conda install numpy
pip install --no-build-isolation -e .[full]
```
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
cd ..
pip install -e .[full]
cd lamorel/lamorel
pip install -e .
pip install wandb gym peft bitsandbytes pyvirtualdisplay
export ALFWORLD_DATA=<storage_path>
alfworld-download
alfworld-generate
change data path in alfworld configs to your custom path
Find and run the scripts in scripts folder.
The current code is based on Llama - 1B model.
Two types of novelty are rewarded:
-
Action novelty (horizontal):
To reduce action repetition from the LLM, actions that occur less frequently in a trajectory are rewarded more. -
Action patterns novelty (vertical):
A novel sequence of actions is rewarded based on an auxiliary model loss (the temporal predictor model).
A T5 model is used as the temporal predictor, trained on the PPO buffer with the task of predicting the next action given the previous sequence of actions in the current trajectory.
Its loss on new trajectories is used as the action pattern novelty reward.
