-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
after updating the Gymnasium to 1.0 and/or stable_baselines3 to latest version, at least dqn_jax.py doesn't work anymore #499
Comments
Quite a few things changed with Gymnasium 1.0, I think. In particular, they changed how final observations are stored when episode terminates: |
This requires updating all the code files. This issue is quite important, and if no one else steps up to handle this update, I might start working on it next week. 🚀 |
@sdpkjc Before you work on it, we have added backward compatibility for vector environments in Gymnasium v1.1, planned to be released in the next few days (https://farama.org/Vector-Autoreset-Mode) |
I felt that the Gymnasium v1.0.0 update notes could use a bit more detail in its explanation of the updated autoreset behaviour for Vector Environments. Below is a short writeup for anyone else who might become as confused as I did. In Gymnasium v1.0.0, Gymnasium released an update to VectorEnvs changing their auto-reset behaviour. Being careless about this small change wasted many precious hours. Here is a simple explainer of the update so you can save your time. Previous behaviourQuoting from the docs:
This means that on the action that leads to the terminal state, the next_obs returned by the environment will be the already reset version, or the first obs of the reset env. In order to access the actual final observation resulting from the final action, you will need to use As pointed out in the docs, this leads to code like this (taken from CleanRL dqn.py): real_next_obs = next_obs.copy()
for idx, d in enumerate(dones):
if d:
real_next_obs[idx] = infos[idx]["terminal_observation"]
rb.add(obs, real_next_obs, actions, rewards, dones, infos) v1.0.0 behaviourQuoting from the docs:
What does it mean that the sub-environment doesn't reset until the next step? First, it means that on the action that leads to the terminal state, the next_obs returned by the environment will be the actual final observation of the environment, thus fixing the annoyances with the More importantly, the environment will only be auto-reset on the next step. No matter what action I pass to the The solution proposed in the docs is to keep track of an replay_buffer = []
obs, _ = envs.reset()
autoreset = np.zeros(envs.num_envs)
for _ in range(total_timesteps):
next_obs, rewards, terminations, truncations, _ = envs.step(envs.action_space.sample())
for j in range(envs.num_envs):
if not autoreset[j]:
replay_buffer.append((
obs[j], rewards[j], terminations[j], truncations[j], next_obs[j]
))
obs = next_obs
autoreset = np.logical_or(terminations, truncations) |
Thank you, @jugheadjones10, for the excellent summary of the changes in Gymnasium v1.0.0! 😄 At the moment, I’m inclined to start with Gymnasium v1.1, considering the compatibility improvements mentioned by @pseudo-rnd-thoughts. We can utilize the Same-Step mode while keeping CleanRL’s autoreset behavior unchanged for now, ensuring that all dependencies are updated first. As the next step, we can then transition to Next-Step mode, which may require rerunning a large number of experiments to ensure consistent results. Looking forward to hearing everyone’s thoughts! 🤔 |
@sdpkjc Sounds like a plan, do you want any help with the update? For the next-step mode, with regards to #448, I would only update the fixed roll out based implementations (i.e., PPO) and leave the rest using same-step as I don't believe there should be a performance difference. |
Yeah ppo_continuous and ppo don't work with the new gym environments. I had to downgrade gym to 0.28.1. |
ATS. thx.
The text was updated successfully, but these errors were encountered: