Our code is built on the Stable Baselines3 library.
This repository contains the code modifications to the Stable Baselines3 library necessary to implement our proposed SUFT method, as well as the run scripts to reproduce the experiments presented in the paper.
We used Python 3.11.10.
Our implementation modifies six Python files from the Stable Baselines3 library:
- buffers.py
- dqn.py
- off_policy_algorithm.py
- ppo.py
- sac.py
- type_aliases.py
Modifications are indicated with # SUFT CHANGE in our code.
In addition, there are two Python files with additional RL agents that are not included in Stable Baselines3:
- regular_dqn.py
- ddqn.py
- Clone and set up the Stable Baselines3 library
- Implement the modified code of the six Python files from this repository. Please note the # SUFT CHANGE comment in the code to add the relevant code modifications.
- (Optional) Implement the additional RL agents.
To train the RL agent using the SUFT method, we created two Python run files:
- Atari: atari_run.py.
- MuJoCo: mujoco_run.py.
Run script:
python atari_run.py "cuda:0" 5.0 "PPO" 400000 42 "ALE/Alien-v5"
Run script:
python atari_run.py "cuda:0" 1.0 "DDQN" 400000 42 "ALE/Alien-v5" 4000
Arguments details:
| Argument | Script Example Value | Description |
|---|---|---|
run_file |
atari_run.py |
Training run script |
run_device |
"cuda:0" |
The device on which the code should be run (specific CUDA GPU) |
tf_lambda |
1.0 |
SUFT OPE term λ hyper-parameter |
agent_name |
"DDQN" |
RL agent ("PPO", "DQN", "DDQN", "DQN1") |
timesteps |
400000 |
Total training steps (400K) |
random_seed |
42 |
Random seed |
gym_env_name |
"ALE/Alien-v5" |
Gymnasium environment (any Atari game) |
buffer_size |
4000 |
Training buffer size |
Run script:
python mujoco_run.py "cpu" 1.8 "PPO" 1000000 "Walker2d-v4"
Run script:
python mujoco_run.py "cpu" 0.6 "SAC" 1000000 "Walker2d-v4" 4000
Arguments details:
| Argument | Script Example Value | Description |
|---|---|---|
run_file |
mujoco_run.py |
Training run script |
run_device |
"cpu" |
The device on which the code should be run (specific CUDA GPU) |
tf_lambda |
0.6 |
SUFT OPE term λ hyper-parameter |
agent_name |
"SAC" |
RL agent ("PPO", "SAC") |
timesteps |
1000000 |
Total training steps (1M) |
gym_env_name |
"Walker2d-v4" |
Gymnasium environment (any MuJoCo environment) |
buffer_size |
4000 |
Training buffer size |
[1] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.