Train Strategy PPO agent #923
Unanswered
101AlexMartin
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello. I'm relatively new to RL and I'm trying to train a PPO agent, but the accuracy over the number of episodes is very unstable (sometimes the average (window of 50 episodes) return is positive, sometimes negative). I wonder if the way I'm training the PPO agent is correct. This is what I currently do:
This does not make much sense to me based on the experience I have with ANNs, where in every epoch the loss is computed over the whole set of training episodes. To train the PPO agent, what would make sense to me would be to use the whole experience dataset to train the agent, and only train it after a certain number of new episodes are added to the dataset, but I wonder if in RL this is not the case, and instead the agents are retrained every time a new episode is added to the dataset, using a batch size smaller than the total number of episodes in the dataset.
Beta Was this translation helpful? Give feedback.
All reactions