Skip to content

Conversation

@asokraju
Copy link

Hi,

there is a bg in policy_gradient_reinforce_tf2.py at line 39.

loss = network.train_on_batch(states, discounted_rewards)

to fix this I made two changes,

  1. one_hot_encode the actions
    one_hot_encode = np.array([[1 if a==i else 0 for i in range(2)] for a in actions])
  2. pass the discounted rewards using 'sample_weight' parameter of 'categorical_crossentropy' loss function

I think it also solves issues #26 #27 #28

I tested it gym.make("CartPole-v0")
It converged in 2000 episodes!

Episode: 1961, Reward: 200.0, avg loss: 0.01056
Episode: 1962, Reward: 200.0, avg loss: 0.02165
Episode: 1963, Reward: 200.0, avg loss: -0.04293
Episode: 1964, Reward: 200.0, avg loss: -0.00953
Episode: 1965, Reward: 200.0, avg loss: 0.02787
Episode: 1966, Reward: 200.0, avg loss: 0.00205
Episode: 1967, Reward: 200.0, avg loss: 0.01984
Episode: 1968, Reward: 200.0, avg loss: 0.00307
Episode: 1969, Reward: 200.0, avg loss: -0.03621
Episode: 1970, Reward: 200.0, avg loss: -0.02112
Episode: 1971, Reward: 200.0, avg loss: -0.00132
Episode: 1972, Reward: 200.0, avg loss: 0.02377
Episode: 1973, Reward: 200.0, avg loss: 0.02295
Episode: 1974, Reward: 200.0, avg loss: -0.01884
Episode: 1975, Reward: 200.0, avg loss: 0.02013
Episode: 1976, Reward: 200.0, avg loss: 0.02265
Episode: 1977, Reward: 200.0, avg loss: 0.00097
Episode: 1978, Reward: 200.0, avg loss: -0.03959
Episode: 1979, Reward: 200.0, avg loss: 0.00527
Episode: 1980, Reward: 200.0, avg loss: 0.02360
Episode: 1981, Reward: 200.0, avg loss: 0.03568
Episode: 1982, Reward: 200.0, avg loss: 0.00684
Episode: 1983, Reward: 200.0, avg loss: 0.00912
Episode: 1984, Reward: 200.0, avg loss: -0.03238
Episode: 1985, Reward: 200.0, avg loss: 0.03891
Episode: 1986, Reward: 200.0, avg loss: 0.01156
Episode: 1987, Reward: 200.0, avg loss: 0.04099
Episode: 1988, Reward: 200.0, avg loss: -0.00574
Episode: 1989, Reward: 200.0, avg loss: 0.01317
Episode: 1990, Reward: 200.0, avg loss: 0.00885
Episode: 1991, Reward: 200.0, avg loss: 0.02338
Episode: 1992, Reward: 200.0, avg loss: 0.00069
Episode: 1993, Reward: 200.0, avg loss: 0.01195
Episode: 1994, Reward: 200.0, avg loss: 0.02862
Episode: 1995, Reward: 200.0, avg loss: -0.00214
Episode: 1996, Reward: 200.0, avg loss: 0.01396
Episode: 1997, Reward: 200.0, avg loss: -0.01529
Episode: 1998, Reward: 200.0, avg loss: 0.01859
Episode: 1999, Reward: 200.0, avg loss: 0.02944

to fix this we make two changes,
1. one_hot_encode the actions
2. pass the discounted rewards using 'sample_weight' parameter of 'categorical_crossentropy' loss function
Copy link

@redszyft redszyft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_actions are not defined anywhere.
I think you need to rename one_hot_encode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants