-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R2D2 not converging? #2
Comments
Hi @NikEyX, Thank you for your trying! I am very interested in whether it is possible to converge by correcting the differences you pointed out. |
Hey, thanks for replying so quickly! Yeah it's not straight-forward. The papers reference one another, so you kinda have to go through them sequentially. Sometimes they omit relevant information, or make certain claims that are a bit disingenuous, such as "in pursuit of generality we decided to disable the treating life losses as episode end", which led me to believe that this is not an important factor originally, whereas in reality this is a very significant factor for achieving higher performance. I'd highly recommend adding the constant epsilon for each worker. If you only have 4 workers, I suggest you simply do a linear scaling (as opposed to the exponential scaling they use), e.g. one worker with 0.3, one with 0.2, one with 0.1, and one with 0.001 (if you do 0 then it might get stuck too often and will prolong training). This should have a large impact on both ApeX and R2D2. For ApeX I use 4 workers, target sync of 500, buffer size of 100,000, priority alpha of 0.6. I don't use importance sampling - I use a different technique similar to eligibility trace, but this doesn't affect pong much. N-step=3, gamma=0.99, batchSize=32. As in the paper, I also pull the samples on 2 background threads and keep them in a queue, so that the GPU can consistently perform calculations without downtime. Doing all the above, I can solve Pong in under 20 minutes (my best was 12 minutes). Same parameters applied to BeamRider, Breakout, QBert and SpaceInvaders all achieve same or higher level performance than in the paper and common rl libs. For more complex problems such as SeaQuest, the above mentioned disabling of loss-of-life is incredibly important. Grad clipping also becomes important so that the neural net doesn't get stuck in suboptimal configurations early on, so you might wanna enable that as well (deepmind uses gc=40), but for the simple atari games it's not required. Now with R2D2 I started out following exactly the paper. However, it didn't seem to converge well on Pong, which I consider the basic test that needs to be well solvable. I trained my model for only 5h at a time and started looking into your implementation to see what I might be missing :) I'll get back to my model and see how I can improve it. As first step I want to be able to solve pong, so maybe I'll get back to it using a smaller sequence. I'll let you know if I get any better results out of it (though it'll likely take a week or longer to train the different cases). |
It ’s amazing how Pong can solve it in 12 minutes! I'm looking forward to the results of R2D2. |
I am working on a solution to this problem, One was that |
Hi,
I am running your model on Pong and it doesn't seem like the R2D2 model is converging at all? In contrast, your Ape-X implementation works and starts converging nicely after 2-3 hours.
Here your R2D2 implementation results after training for 32 hours on an 1080 TI with 4 workers:
Note there are various items in your implementation that are different from the papers for both Ape-X and R2D2, such as worker epsilons being below 0.4 and always constant (which has a significant impact on convergence speed) , or the DM R2D2 model taking as additional input the last action and last reward.
Did you manage to get any convergence yourself? If so, how can I replicate it?
The text was updated successfully, but these errors were encountered: