after updating the Gymnasium to 1.0 and/or stable_baselines3 to latest version, at least dqn_jax.py doesn't work anymore #499

imkow · 2025-02-17T17:51:43Z

ATS. thx.

jugheadjones10 · 2025-02-26T08:51:42Z

Quite a few things changed with Gymnasium 1.0, I think. In particular, they changed how final observations are stored when episode terminates:
https://gymnasium.farama.org/gymnasium_release_notes/
I had to make my own modifications to get dqn_jax to work with updated Gymnasium.

sdpkjc · 2025-02-26T09:23:45Z

This requires updating all the code files. This issue is quite important, and if no one else steps up to handle this update, I might start working on it next week. 🚀

pseudo-rnd-thoughts · 2025-02-26T10:27:00Z

@sdpkjc Before you work on it, we have added backward compatibility for vector environments in Gymnasium v1.1, planned to be released in the next few days (https://farama.org/Vector-Autoreset-Mode)
This would allow cleanrl to update to Gymnasium v1.1 with minimal changes

jugheadjones10 · 2025-02-27T15:32:54Z

I felt that the Gymnasium v1.0.0 update notes could use a bit more detail in its explanation of the updated autoreset behaviour for Vector Environments. Below is a short writeup for anyone else who might become as confused as I did.

In Gymnasium v1.0.0, Gymnasium released an update to VectorEnvs changing their auto-reset behaviour. Being careless about this small change wasted many precious hours. Here is a simple explainer of the update so you can save your time.

Previous behaviour

Quoting from the docs:

Previously in Gym and Gymnasium, auto-resetting was done on the same step as the environment episode ends, such that the final observation and info would be stored in the step's info, i.e., info["final_observation"] and info[“final_info”] and standard obs and info containing the sub-environment's reset observation and info.

This means that on the action that leads to the terminal state, the next_obs returned by the environment will be the already reset version, or the first obs of the reset env. In order to access the actual final observation resulting from the final action, you will need to use info["final_observation"].

As pointed out in the docs, this leads to code like this (taken from CleanRL dqn.py):

real_next_obs = next_obs.copy()
for idx, d in enumerate(dones):
	if d:
		real_next_obs[idx] = infos[idx]["terminal_observation"]
rb.add(obs, real_next_obs, actions, rewards, dones, infos)

v1.0.0 behaviour

Quoting from the docs:

However, over time, the development team has recognized the inefficiency of this approach (primarily due to the extensive use of a Python dictionary) and the annoyance of having to extract the final observation to train agents correctly, for example. Therefore, in v1.0.0, we are modifying autoreset to align with specialized vector-only projects like EnvPool and SampleFactory where the sub-environment doesn't reset until the next step.

What does it mean that the sub-environment doesn't reset until the next step? First, it means that on the action that leads to the terminal state, the next_obs returned by the environment will be the actual final observation of the environment, thus fixing the annoyances with the infos dictionary in previous versions.

More importantly, the environment will only be auto-reset on the next step. No matter what action I pass to the env.step() function, the observation I get will be the default initial observation of a newly reset environment. This may mean that you essentially need to "throw away" the first transition after a previous episode finishes, because the state, action, next state tuple will be all messed up. The "state" will be the final observation of the previous episode, the action will be whatever action your model produces, and next state will be the initial observation returned by the reset environment. You can see how that might mess up a TD-update.

The solution proposed in the docs is to keep track of an autoreset array that tracks which environment is pending an autoreset. If it is, it is not added to the replay buffer:

replay_buffer = []
obs, _ = envs.reset()
autoreset = np.zeros(envs.num_envs)
for _ in range(total_timesteps):
    next_obs, rewards, terminations, truncations, _ = envs.step(envs.action_space.sample())

    for j in range(envs.num_envs):
        if not autoreset[j]:
            replay_buffer.append((
                obs[j], rewards[j], terminations[j], truncations[j], next_obs[j]
            ))

    obs = next_obs
    autoreset = np.logical_or(terminations, truncations)

sdpkjc · 2025-02-27T15:46:41Z

Thank you, @jugheadjones10, for the excellent summary of the changes in Gymnasium v1.0.0! 😄

At the moment, I’m inclined to start with Gymnasium v1.1, considering the compatibility improvements mentioned by @pseudo-rnd-thoughts. We can utilize the Same-Step mode while keeping CleanRL’s autoreset behavior unchanged for now, ensuring that all dependencies are updated first.

As the next step, we can then transition to Next-Step mode, which may require rerunning a large number of experiments to ensure consistent results.

Looking forward to hearing everyone’s thoughts! 🤔

pseudo-rnd-thoughts · 2025-02-27T16:33:53Z

@sdpkjc Sounds like a plan, do you want any help with the update?

For the next-step mode, with regards to #448, I would only update the fixed roll out based implementations (i.e., PPO) and leave the rest using same-step as I don't believe there should be a performance difference.
It might be worth making a version of DQN or PPO with the different autoreset mode to help users but that would be separate change

RishiMalhotra920 · 2025-03-15T18:39:35Z

Yeah ppo_continuous and ppo don't work with the new gym environments. I had to downgrade gym to 0.28.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after updating the Gymnasium to 1.0 and/or stable_baselines3 to latest version, at least dqn_jax.py doesn't work anymore #499

after updating the Gymnasium to 1.0 and/or stable_baselines3 to latest version, at least dqn_jax.py doesn't work anymore #499

imkow commented Feb 17, 2025

jugheadjones10 commented Feb 26, 2025 •

edited

Loading

sdpkjc commented Feb 26, 2025

pseudo-rnd-thoughts commented Feb 26, 2025

jugheadjones10 commented Feb 27, 2025 •

edited

Loading

sdpkjc commented Feb 27, 2025

pseudo-rnd-thoughts commented Feb 27, 2025

RishiMalhotra920 commented Mar 15, 2025

after updating the Gymnasium to 1.0 and/or stable_baselines3 to latest version, at least dqn_jax.py doesn't work anymore #499

after updating the Gymnasium to 1.0 and/or stable_baselines3 to latest version, at least dqn_jax.py doesn't work anymore #499

Comments

imkow commented Feb 17, 2025

jugheadjones10 commented Feb 26, 2025 • edited Loading

sdpkjc commented Feb 26, 2025

pseudo-rnd-thoughts commented Feb 26, 2025

jugheadjones10 commented Feb 27, 2025 • edited Loading

Previous behaviour

v1.0.0 behaviour

sdpkjc commented Feb 27, 2025

pseudo-rnd-thoughts commented Feb 27, 2025

RishiMalhotra920 commented Mar 15, 2025

jugheadjones10 commented Feb 26, 2025 •

edited

Loading

jugheadjones10 commented Feb 27, 2025 •

edited

Loading