Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R2D2 not converging? #2

Open
NikEyX opened this issue Sep 17, 2019 · 4 comments
Open

R2D2 not converging? #2

NikEyX opened this issue Sep 17, 2019 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@NikEyX
Copy link

NikEyX commented Sep 17, 2019

Hi,

I am running your model on Pong and it doesn't seem like the R2D2 model is converging at all? In contrast, your Ape-X implementation works and starts converging nicely after 2-3 hours.

Here your R2D2 implementation results after training for 32 hours on an 1080 TI with 4 workers:

image

Note there are various items in your implementation that are different from the papers for both Ape-X and R2D2, such as worker epsilons being below 0.4 and always constant (which has a significant impact on convergence speed) , or the DM R2D2 model taking as additional input the last action and last reward.

Did you manage to get any convergence yourself? If so, how can I replicate it?

@neka-nat
Copy link
Owner

Hi @NikEyX,

Thank you for your trying!
The implementation of R2D2 is made experimentally.
I couldn't launch 256 actors like in the paper, so I tuned some parameters but couldn't get it to converge.

I am very interested in whether it is possible to converge by correcting the differences you pointed out.
Do you think it will converge by correcting the differences of implementation?

@NikEyX
Copy link
Author

NikEyX commented Sep 17, 2019

Hey, thanks for replying so quickly!

Yeah it's not straight-forward. The papers reference one another, so you kinda have to go through them sequentially. Sometimes they omit relevant information, or make certain claims that are a bit disingenuous, such as "in pursuit of generality we decided to disable the treating life losses as episode end", which led me to believe that this is not an important factor originally, whereas in reality this is a very significant factor for achieving higher performance.

I'd highly recommend adding the constant epsilon for each worker. If you only have 4 workers, I suggest you simply do a linear scaling (as opposed to the exponential scaling they use), e.g. one worker with 0.3, one with 0.2, one with 0.1, and one with 0.001 (if you do 0 then it might get stuck too often and will prolong training). This should have a large impact on both ApeX and R2D2.

For ApeX I use 4 workers, target sync of 500, buffer size of 100,000, priority alpha of 0.6. I don't use importance sampling - I use a different technique similar to eligibility trace, but this doesn't affect pong much. N-step=3, gamma=0.99, batchSize=32. As in the paper, I also pull the samples on 2 background threads and keep them in a queue, so that the GPU can consistently perform calculations without downtime. Doing all the above, I can solve Pong in under 20 minutes (my best was 12 minutes). Same parameters applied to BeamRider, Breakout, QBert and SpaceInvaders all achieve same or higher level performance than in the paper and common rl libs.

For more complex problems such as SeaQuest, the above mentioned disabling of loss-of-life is incredibly important. Grad clipping also becomes important so that the neural net doesn't get stuck in suboptimal configurations early on, so you might wanna enable that as well (deepmind uses gc=40), but for the simple atari games it's not required.

Now with R2D2 I started out following exactly the paper. However, it didn't seem to converge well on Pong, which I consider the basic test that needs to be well solvable. I trained my model for only 5h at a time and started looking into your implementation to see what I might be missing :)

I'll get back to my model and see how I can improve it. As first step I want to be able to solve pong, so maybe I'll get back to it using a smaller sequence. I'll let you know if I get any better results out of it (though it'll likely take a week or longer to train the different cases).

@neka-nat
Copy link
Owner

It ’s amazing how Pong can solve it in 12 minutes!
If possible, I would like to refer to your implementation, or please help me contribute to this repository.

I'm looking forward to the results of R2D2.
Please share your results.

@neka-nat neka-nat added the help wanted Extra attention is needed label Sep 23, 2019
@neka-nat
Copy link
Owner

I am working on a solution to this problem,
and I found some bugs and fixed them.

One was that no_grad was nested, and no_grad did not work well.
This bug probably caused burn-in to not work.
I tried the test after fixing the bug, but it still does not converge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants