A framework for online learning in the constrained linear contextual bandit setting. The code is structured as follows:
- Core functionality is implemented in scripts prefixed by an underscore:
_bandit_learning.py- defines a variety of "safe" and "unsafe" bandit learning algorithms, as well as theevaluate()function for running algorithm-environment interactions and recording the results;_BanditEnv.py- defines theBanditEnvclass for representing contextual bandit problems; includes getter functions to construct bandit problems with different properties;_utils.py- defines short, resuable functions likelinear_regression();_visualize_results.py- tools for producing plots, and serializing/deserializing experiment data.
- Experiment configuration is done by two kinds of files in the
experimentsfolder:- Learning algorithm configuration (scripts prefixed by
algorithms_) - Bandit environment configuration (all other scripts)
- Learning algorithm configuration (scripts prefixed by
- To run an experiment, execute
experiment_driver.py, which will use the specified experiment configurations and call theexperiment_worker.pyto make repeated calls to_bandit_learning.evaluate(). - One-off scripts with no dependencies live in the
standalonefolder.
The framework was used for simulation experiments in chapter 4 of my thesis. I described the problem setting as:
... a constrained reinforcement learning problem, where in addition to reward maximization, a decision maker must also select actions according to a constraint on their “safety.” Constraint satisfaction, like the underlying reward signal, is estimated from noisy data and thus requires careful handling of uncertainty.