System requirements: >=python3.6
For Linux/osx:
sudo apt install -y libglu1-mesa-dev libgl1-mesa-dev libosmesa6-dev xvfb ffmpeg curl patchelf libglfw3 libglfw3-dev cmake zlib1g zlib1g-dev swig
git clone https://github.com/bhatiaabhinav/RL-v2.git
cd RL-v2
python3 -m venv env
source env/bin/activate
pip install --upgrade pip
pip install wheel
pip install -r requirements_ubuntu.txt
pip install -e .
mkdir logsor set env variable RL_LOGDIR to specify logs directory.
For Windows: (Many of the things below can be more conviniently installed using Chocolatey package manager for Windows)
- Install git (with Unix tools).
- Install VisualStudio Community Edition 2019 with MS Build Tools 2015, 2017, 2019.
- Install swig.
- Install ffmpeg.
- Install cuda and corresponding version of pytorch (after activating virtual env). https://pytorch.org/.
Then:
git clone https://github.com/bhatiaabhinav/RL-v2.git
cd RL-v2
python3 -m venv env
./env/Scripts/activate
pip install --upgrade pip
pip install wheel
pip install -r requirements_windows.txt
pip install -e .
mkdir logsActivate virtual env:
cd RL-v2
source env/bin/activateor for Windows
./env/Scripts/activateThen:
python -m RL env_id algo_id num_steps_to_run --algo_suffix=my_run_name --tags tag1 tag2- To run an experiment for fixed number of episodes instead of a fixed number of steps, set parameter
--num_episodes_to_runaccordingly and setnum_steps_to_runto a very high value (say 1000000000)
A Weights & Biases account is needed to run the module. It is tool to track, log, graph and visualize machine learning projects. On running the module for the first time, it will ask for the authorization key of the account. Once, that is set, all the future runs will be logged to that account.
A run will be recorded by the name {algo_id}_{algo_suffix} in the {env_name} project in the wandb account. On starting a run, the link to view it in the wandb dashboard will be printed in the beginning.
Note for Windows: If facing some problems regarding wandb permissions, you can try running in admin command prompt or admin powershell.
The logs for an experiment are stored in directory: {logs folder}/{env name}/{algo_id}_{algo_suffix}.
- Attempting to re-run an experiment with the same <env_name, algo_id, algo_suffix> combo will result in the module refusing to run to prevent overwriting previous logs. Speficy
--overwriteflag to force the re-run and overwrite the logs. - A different logs folder can be specified (instead of default 'logs' folder in the working directory, or $RL_LOGDIR if that environment variable is set) using parameter
--rl_logdir. - To run in debug mode (i.e. record all logs created using logger.debug('') in the log file), set
--debugflag. By default, INFO and above level logs are recorded. To disable INFO level logs, set--no_logsflag.
By default env's rendering is turned on. To turn off, specify --no_render flag.
If OpenAI Gym monitor is used to record videos of episodes, the videos will be saved in run's logs directory, and will also be available for visualization in the wandb dashboard.
Specify --no_monitor flag to disable monitor. Or if monitor is used, specify episode intervals for recording videos using --monitor_video_freq (default 100) parameter.
By default the module will use Nvidia cuda if a GPU is available. To disable GPU use, set --no_gpu flag.
- DQN Vanilla
- Double DQN. Speficy
--double_dqnflag. - Dueling DQN. Specify
--dueling_dqnflag. - N-Step DQN. Specify
--nstepsparameter (default=1). - To turn on soft-copying of Q network parameters to target network (like in DDPG), specify e.g.
--target_q_polyak=0.999(default=0) and also set--target_q_freq=1(default 10000) so as to copy the parameters every training step. - Soft (Boltzman) policy. Set temperature using
--dqn_ptemp(default=0). --ep(default=0.1) Value of epsilon in epsilon-greedy action selection.-ep_anneal_steps(default=1000000). Numner of steps over which the epsilon should be annealed from 1 toep. The annealing begins aftermin_explore_stepsphase.min_explore_stepsis explained below.- By default, huber loss is used for the Q network. Specify
--dqn_mse_lossto change to change to mse loss.
- DDPG Vanilla (but without OH noise exploration)
- Adaptive Param Noise Exploration. Set target deviation of noisy policy using
--ddpg_noise(default 0.2). - N-Step DDPG. Specify
--nstepsparameter (default=1).
- SAC Vanilla
- Adaptive alpha to maintain constant entropy (turned on default). Specify initial alpha
--sac_alpha(default 0.2). To turn off adaptive alpha, specify--fix_alphaflag. - N-Step SAC. Specify
--nstepsparameter (default=1).
Note: DDPG and SAC algorithms soft-copy Q net params to target net
- SAC Vanilla
- Adaptive alpha to maintain constant entropy (turned on default). Specify initial alpha
--sac_alpha(default 0.2). To turn off adaptive alpha, specify--fix_alphaflag. In adaptive alpha, alpha never falls belowsac_alphavalue i.e. it serves as minimum value of alpha. - By default, huber loss is used for the Q network. Specify
--dqn_mse_lossto change to change to mse loss. - N-Step SAC. Specify
--nstepsparameter (default=1).
Note: DDPG and SAC algorithms soft-copy Q net params to target net
- Network: Upto 3 convolutional layer can be specified, followed by any number of hidden layers. All conv layers are automatically ignored when the input is not an image.
--conv1(default= 32 8 4 0). Parameters of the first convolution layer. Specify like this:--conv1 channels kernel_size stride padding. Specify--conv1 0to skip this layer.--conv2(default= 64 4 2 0). Specification format same as conv1. Specify--conv2 0to skip this layer.--conv3(default= 64 3 1 0). Specification format same as conv1. Specify--conv3 0to skip this layer.--hiddens(default= 512). Hidden layers specify like:--hiddens h1 h2 h3. E.g.--hiddens 512 256 64 32to create 4 hidden layers with respective number of nodes. To specify no hidden layers, pass--hiddensi.e. the argument name followed by only a whitespace.
--gamma(default=0.99). Discount factor for training.--exp_buff_len(default=1M). Experience buffer length.--no_ignore_done_on_timlimit. By default, the experience buffer recordsdoneas false when an episode is terminated artifically due to timelimit wrapper (because the episode did not really end, and recording done=True in such cases would cause difficulty in learning the value function, since the markov property of the preceding state would break). Specify this flag to disable this functionality.--min_explore_steps(default=1000000). In the beginning of training, execute a random policy for these many steps.--exploit_freq(default=None). Play the learnt policy (so far) for an episode without exploration every exploit_freq episodes.--train_freq(default=4). Train step every these many episode steps.--sgd_steps(default=1). Number of SGD updates per training step.--lr(default=1e-4). Learning rate for Adam optimizer.--mb_size(default=32). Minibatch size.--td_clip(default=None). Clip temporal difference errors to this magnitude.--grad_clip(default=None). Clip gradients (by norm) to this magnitude.
--seed. None by default, leading to inderministic behavior. If set to some value, the following get seeded with the specified value: python's random module, torch and numpy.--reward_scaling(default=1). Scales rewards for training. The logs & graphs also record scaled returns, unless--record_unscaledflag is specified.record_discountedflag. When set, causes logs & graphs to record discounted returns per episode, instead of sum of rewards per episode.--model_save_freq(default=1000000). The model is saved every these many steps to checkpoints directory
- For non-atari environments (both image-based or non image-based):
- Framestack e.g.
--framestack=4(default=1) - Frameskip e.g.
--frameskip=4(default=1) - Artifical timelimit e.g.
--artificial_timelimit=500(default None). It uses built-in gym's Timelimit wrapper to force termination of episodes at these many steps.
- Framestack e.g.
- Atari Specific:
- Framestack e.g.
--atari_framestack=4(default=4) - Frameskip e.g.
--atari_frameskip=4(default=4) - Max num of noops in beginnning of episode e.g.
--atari_noop_max=30(default=30). - reward clipping flag.
--atari_clip_rewards.
- Framestack e.g.
Framestack is always applied on top of frameskip. e.g. if frameskip is 4 and framestack is 3, then frames 0,4,8 are stacked together to form the first step's observation. Then frame 4,8,12 are stacked together to form the second step's observation. and so on.
For Atari, use only environments:
- ending in
NoFrameskip-v4. e.g.BreakoutNoFrameskip-v4orPongNoFrameskip-v4etc. to play from pixels. - or ending in
-ramNoFrameskip-v4e.g. `Breakout-ramNoFrameskip-v4' to play from RAM state.
Any other environment will be treated as non-atari environments.
Another note: --atari_framestack is ignored for RAM based atari environments. To force frame stacking, use the general --framestack argument.
python -m RL CartPole-v0 DQN 20000 --algo_suffix=quicktest --seed=0 --hiddens 64 32 --train_freq=1 --target_q_freq=2000 --nsteps=3 --min_explore_steps=10000 --ep_anneal_steps=10000 --ep=0.01 --no_renderor
python -m RL BreakoutNoFrameskip-v4 DQN 10000000 --algo_suffix=mnih --seed=0 --conv1 32 8 4 0 --conv2 64 4 2 0 --conv3 64 3 1 0 --hiddens 512 --no_renderor
python -m RL BreakoutNoFrameskip-v4 DQN 10000000 --algo_suffix=3step_mnih_small --seed=0 --conv1 16 8 4 0 --conv2 32 4 2 0 --conv3 0 --hiddens 256 --target_q_freq=8000 --nsteps=3 --min_explore_steps=100000 --ep_anneal_steps=100000 --exp_buff_len=100000 --no_renderor
python -m RL BreakoutNoFrameskip-v4 DQN 10000000 --algo_suffix=mnih_big --seed=0 --conv1 64 6 2 0 --conv2 64 6 2 2 --conv3 64 6 2 2 --hiddens 1024 --no_renderpython -m RL Pendulum-v0 SAC 1000000 --algo_suffix=quicktest_gc1 --seed=0 --hiddens 64 32 --train_freq=1 --min_explore_steps=10000 --grad_clip=1 --no_renderpython -m RL BipedalWalker-v3 SAC 1000000 --algo_suffix=T200_gc1 --seed=0 --hiddens 64 32 --train_freq=1 --min_explore_steps=10000 --grad_clip=1 --artificial_timelimit=200 --no_renderpython -m RL LunarLander-v2 SACDiscrete 200000 --algo_suffix=T500_3step --artificial_timelimit=500 --seed=0 --hiddens 64 32 --train_freq=1 --nsteps=3 --min_explore_steps=10000 --dqn_mse_loss --no_render --monitor_video_freq=20To get list of specifiable hyperparams:
python -m RL -h