You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
learner.epsilon_greedy_search(...) is often used for training agents with different algorithms, including DQL in the dql_run. However dql_exploit_run with input network dql_run as policy-agent and eval_episode_count parameter for the number of episodes, gives an impression that runs are used for evaluation of the trained DQN. The only distinguishable difference between 2 runs is epsilon queal to 0, which leads to exploitation mode of training, but does not exclude training, because during run with learner.epsilon_greedy_search the optimizer.step() is executed on each step of training in the file agent_dql.py, function call learner.on_step(...).
Solution: I will include in Pull request the code I used for better evaluation (based on learner.epsilon_greedy_search(...) and generate pictures below.
ToyCTF benchmark is inaccurate, because with correct evaluation procedure, like with chain network configuration, agent does not reqch goal of 6 owned nodes after 200 training episodes.
The text was updated successfully, but these errors were encountered:
That's a valid point though I think it's valid and fair game to assume the agent can continue to learn and adjust it's policy during evaluation as long as its state (Q-function) is reset at the beginning of each episode to what it was after the learning phase (which might not be currently the case and may need to be fixed.)
Also if we really want to handicap the agent and prevent it to learn during an episode then I suggest we add a freeze_learning:bool parameter to DeepQLearnerPolicy agent. If set to true then function update_q_function becomes a no-op.
Issue forked from #87 by @kvas7andy
learner.epsilon_greedy_search(...) is often used for training agents with different algorithms, including DQL in the
dql_run
. Howeverdql_exploit_run
with input networkdql_run
as policy-agent andeval_episode_count
parameter for the number of episodes, gives an impression that runs are used for evaluation of the trained DQN. The only distinguishable difference between 2 runs is epsilon queal to 0, which leads to exploitation mode of training, but does not exclude training, because during run with learner.epsilon_greedy_search theoptimizer.step()
is executed on each step of training in the fileagent_dql.py
, function call learner.on_step(...).dql_exploit_run
internally usinglearner.on_step(...)
figure 2 leads to much better results, due to optimization process, which still process ongoing experience of agent. We can overcome this inaccurate evaluation and still reach the goal in 100% of times figure 3, while training on 200 episodes with commentedlearner.on_step()
. It fixes trained network and stops optimizing during evaluation, but leads to the ownership of all the network with larger amount of learning episodes. This means with 200 episodes it is feasible to learn optimal path of agent attacks inside chain network configuration.Lastly, figure 4 we can compare those runs with correct evaluation runs on 20 episodes reach 6000+ and 120+ cumulative reward for for 200 and 50 training episodes correspondently.
Figure 1: (after PR) no optimizer during evaluation, 20 trained episodes, 20 evaluation episodes
Figure 2: (before & after PR) dql_exploit_run with optimizer during evaluation, 20 trained episodes, 5 evaluation episodes
Figure3: (after PR) no optimizer during evaluation, 200 trained episodes, 20 evaluation episodes
Figure 4: (after PR) comparison of evaluation for network trained on 200 and 20 episodes, chain network configuration
ToyCTF benchmark is inaccurate, because with correct evaluation procedure, like with chain network configuration, agent does not reqch goal of 6 owned nodes after 200 training episodes.
The text was updated successfully, but these errors were encountered: