Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

after updating the Gymnasium to 1.0 and/or stable_baselines3 to latest version, at least dqn_jax.py doesn't work anymore #499

Open
imkow opened this issue Feb 17, 2025 · 7 comments

Comments

@imkow
Copy link

imkow commented Feb 17, 2025

ATS. thx.

@jugheadjones10
Copy link

jugheadjones10 commented Feb 26, 2025

Quite a few things changed with Gymnasium 1.0, I think. In particular, they changed how final observations are stored when episode terminates:
https://gymnasium.farama.org/gymnasium_release_notes/
I had to make my own modifications to get dqn_jax to work with updated Gymnasium.

@sdpkjc
Copy link
Collaborator

sdpkjc commented Feb 26, 2025

This requires updating all the code files. This issue is quite important, and if no one else steps up to handle this update, I might start working on it next week. 🚀

@pseudo-rnd-thoughts
Copy link
Collaborator

@sdpkjc Before you work on it, we have added backward compatibility for vector environments in Gymnasium v1.1, planned to be released in the next few days (https://farama.org/Vector-Autoreset-Mode)
This would allow cleanrl to update to Gymnasium v1.1 with minimal changes

@jugheadjones10
Copy link

jugheadjones10 commented Feb 27, 2025

I felt that the Gymnasium v1.0.0 update notes could use a bit more detail in its explanation of the updated autoreset behaviour for Vector Environments. Below is a short writeup for anyone else who might become as confused as I did.

In Gymnasium v1.0.0, Gymnasium released an update to VectorEnvs changing their auto-reset behaviour. Being careless about this small change wasted many precious hours. Here is a simple explainer of the update so you can save your time.

Previous behaviour

Quoting from the docs:

Previously in Gym and Gymnasium, auto-resetting was done on the same step as the environment episode ends, such that the final observation and info would be stored in the step's info, i.e., info["final_observation"] and info[“final_info”] and standard obs and info containing the sub-environment's reset observation and info.

This means that on the action that leads to the terminal state, the next_obs returned by the environment will be the already reset version, or the first obs of the reset env. In order to access the actual final observation resulting from the final action, you will need to use info["final_observation"].

As pointed out in the docs, this leads to code like this (taken from CleanRL dqn.py):

real_next_obs = next_obs.copy()
for idx, d in enumerate(dones):
	if d:
		real_next_obs[idx] = infos[idx]["terminal_observation"]
rb.add(obs, real_next_obs, actions, rewards, dones, infos)

v1.0.0 behaviour

Quoting from the docs:

However, over time, the development team has recognized the inefficiency of this approach (primarily due to the extensive use of a Python dictionary) and the annoyance of having to extract the final observation to train agents correctly, for example. Therefore, in v1.0.0, we are modifying autoreset to align with specialized vector-only projects like EnvPool and SampleFactory where the sub-environment doesn't reset until the next step.

What does it mean that the sub-environment doesn't reset until the next step? First, it means that on the action that leads to the terminal state, the next_obs returned by the environment will be the actual final observation of the environment, thus fixing the annoyances with the infos dictionary in previous versions.

More importantly, the environment will only be auto-reset on the next step. No matter what action I pass to the env.step() function, the observation I get will be the default initial observation of a newly reset environment. This may mean that you essentially need to "throw away" the first transition after a previous episode finishes, because the state, action, next state tuple will be all messed up. The "state" will be the final observation of the previous episode, the action will be whatever action your model produces, and next state will be the initial observation returned by the reset environment. You can see how that might mess up a TD-update.

The solution proposed in the docs is to keep track of an autoreset array that tracks which environment is pending an autoreset. If it is, it is not added to the replay buffer:

replay_buffer = []
obs, _ = envs.reset()
autoreset = np.zeros(envs.num_envs)
for _ in range(total_timesteps):
    next_obs, rewards, terminations, truncations, _ = envs.step(envs.action_space.sample())

    for j in range(envs.num_envs):
        if not autoreset[j]:
            replay_buffer.append((
                obs[j], rewards[j], terminations[j], truncations[j], next_obs[j]
            ))

    obs = next_obs
    autoreset = np.logical_or(terminations, truncations)

@sdpkjc
Copy link
Collaborator

sdpkjc commented Feb 27, 2025

Thank you, @jugheadjones10, for the excellent summary of the changes in Gymnasium v1.0.0! 😄

At the moment, I’m inclined to start with Gymnasium v1.1, considering the compatibility improvements mentioned by @pseudo-rnd-thoughts. We can utilize the Same-Step mode while keeping CleanRL’s autoreset behavior unchanged for now, ensuring that all dependencies are updated first.

As the next step, we can then transition to Next-Step mode, which may require rerunning a large number of experiments to ensure consistent results.

Looking forward to hearing everyone’s thoughts! 🤔

@pseudo-rnd-thoughts
Copy link
Collaborator

@sdpkjc Sounds like a plan, do you want any help with the update?

For the next-step mode, with regards to #448, I would only update the fixed roll out based implementations (i.e., PPO) and leave the rest using same-step as I don't believe there should be a performance difference.
It might be worth making a version of DQN or PPO with the different autoreset mode to help users but that would be separate change

@RishiMalhotra920
Copy link

Yeah ppo_continuous and ppo don't work with the new gym environments. I had to downgrade gym to 0.28.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants