Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Poster at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Authors

Paper Link

Code Overview

Our code is built on the Stable Baselines3 library.

This repository contains the code modifications to the Stable Baselines3 library necessary to implement our proposed SUFT method, as well as the run scripts to reproduce the experiments presented in the paper.

We used Python 3.11.10.

Modified Files from Stable Baselines3:

Our implementation modifies six Python files from the Stable Baselines3 library:

buffers.py
dqn.py
off_policy_algorithm.py
ppo.py
sac.py
type_aliases.py

Modifications are indicated with # SUFT CHANGE in our code.

Additional RL Agents:

In addition, there are two Python files with additional RL agents that are not included in Stable Baselines3:

regular_dqn.py
ddqn.py

Run instructions

Setup

Clone and set up the Stable Baselines3 library
Implement the modified code of the six Python files from this repository. Please note the # SUFT CHANGE comment in the code to add the relevant code modifications.
(Optional) Implement the additional RL agents.

Run Scripts

To train the RL agent using the SUFT method, we created two Python run files:

Atari: atari_run.py.
MuJoCo: mujoco_run.py.

Examples:

Atari PPO Example

Run script:

python atari_run.py "cuda:0" 5.0 "PPO" 400000 42 "ALE/Alien-v5"

Atari Vanilla DQN / Double DQN / Regular DQN Example

Run script:

python atari_run.py "cuda:0" 1.0 "DDQN" 400000 42 "ALE/Alien-v5" 4000

Arguments details:

Argument	Script Example Value	Description
`run_file`	`atari_run.py`	Training run script
`run_device`	`"cuda:0"`	The device on which the code should be run (specific CUDA GPU)
`tf_lambda`	`1.0`	SUFT OPE term λ hyper-parameter
`agent_name`	`"DDQN"`	RL agent (`"PPO"`, `"DQN"`, `"DDQN"`, `"DQN1"`)
`timesteps`	`400000`	Total training steps (400K)
`random_seed`	`42`	Random seed
`gym_env_name`	`"ALE/Alien-v5"`	Gymnasium environment (any Atari game)
`buffer_size`	`4000`	Training buffer size

MuJoCo PPO Example:

Run script:

python mujoco_run.py "cpu" 1.8 "PPO" 1000000 "Walker2d-v4"

MuJoCo SAC Example

Run script:

python mujoco_run.py "cpu" 0.6 "SAC" 1000000 "Walker2d-v4" 4000

Arguments details:

Argument	Script Example Value	Description
`run_file`	`mujoco_run.py`	Training run script
`run_device`	`"cpu"`	The device on which the code should be run (specific CUDA GPU)
`tf_lambda`	`0.6`	SUFT OPE term λ hyper-parameter
`agent_name`	`"SAC"`	RL agent (`"PPO"`, `"SAC"`)
`timesteps`	`1000000`	Total training steps (1M)
`gym_env_name`	`"Walker2d-v4"`	Gymnasium environment (any MuJoCo environment)
`buffer_size`	`4000`	Training buffer size

References

[1] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
code		code
run_scripts		run_scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Poster at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Authors

Paper Link

Code Overview

Modified Files from Stable Baselines3:

Additional RL Agents:

Run instructions

Setup

Run Scripts

Examples:

Atari PPO Example

Atari Vanilla DQN / Double DQN / Regular DQN Example

MuJoCo PPO Example:

MuJoCo SAC Example

References

About

Uh oh!

Releases

Packages

Languages

License

shaham-lab/TurningSandToGold

Folders and files

Latest commit

History

Repository files navigation

Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Poster at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Authors

Paper Link

Code Overview

Modified Files from Stable Baselines3:

Additional RL Agents:

Run instructions

Setup

Run Scripts

Examples:

Atari PPO Example

Atari Vanilla DQN / Double DQN / Regular DQN Example

MuJoCo PPO Example:

MuJoCo SAC Example

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages