Description
Description
Add a builtin component for launching HPO (hyper-parameter optimization) jobs. At a high-level something akin to:
# for grid search
$ torchx run -s kubernetes hpo.grid_search --paramspacefile=~/parameters.json --component dist.ddp
# for bayesian search
$ torchx run -s kubernetes hpo.bayesian ...
In both cases we use the Ax/TorchX integration to run the HPO driver job. (see motivation section below for details)
Motivation/Background
TorchX already integrates with Ax that supports both bayesian and grid_search HPO. Some definitions before we get started:
- Ax: Experiment - (docs) Defines the HPO search space and holds the optimizer state. Vends out the next set of parameters to search based on the observed results (relevant for Bayesian and Bandit optimizations, not so much for grid search).
- Ax: Trials - (docs) A step in an experiment, aka a (training) job that runs with a specific set of hyper-parameters as vended out by the optimizer in the experiment
- Ax: Runner - (docs) Responsible for launching trials.
Ax/TorchX integration is done at the Runner level. We implemented an ax/TorchXRunner
that implements Ax's Runner
interface (do not confuse this with the TorchX runner. TorchX itself defines a runner concept). The ax/TorchXRunner
runs the ax Trials using TorchX.
The ax/TorchXRunnerTest
serves as a full end-to-end example of how everything works. In summary the test runs a bayesian HPO to minimize the "booth" function. Note that in practice this function is replaced by your "trainer". The main module that computes the booth function given the parameters x_1
and x_2
as inputs is defined in torchx.apps.utils.booth
.
The abridged code looks something like this:
parameters: List[Parameter] = [
RangeParameter(
name="x1",
lower=-10.0,
upper=10.0,
parameter_type=ParameterType.FLOAT,
),
RangeParameter(
name="x2",
lower=-10.0,
upper=10.0,
parameter_type=ParameterType.FLOAT,
),
]
experiment = Experiment(
name="torchx_booth_sequential_demo",
search_space=SearchSpace(parameters=self._parameters),
optimization_config=OptimizationConfig(
objective = Objective(metric=TorchXMetric(name="booth_eval"),
minimize=True,
),
runner=TorchXRunner(
tracker_base=self.test_dir,
component=utils.booth,
scheduler="local_cwd",
cfg={"prepend_cwd": True},
),
)
scheduler = Scheduler(
experiment=experiment,
generation_strategy=choose_generation_strategy(search_space=experiment.search_space),
options=SchedulerOptions(),
)
for _ in range(3):
scheduler.run_n_trials(max_trials=2)
scheduler.report_results()
Detailed Proposal
The task here is to essentially create pre-packaged applications for the code above. We can define a two types of HPO apps by the "strategy" used:
- hpo.grid_search
- hpo.bayesian
Each application will come with a companion "component" (e.g. hpo.grid_search
and hpo.bayesian
). The applications should be designed to take as input:
- parameter space
- what the objective function is (e.g. trainer)
- torchx cfgs (e.g. scheduler, scheduler runcfg, etc)
- ax experiment configs
The challenge is to be able to correctly and sanely "parameterize" the application in such a way that allows the user to sanely pass these argument from the CLI. For complex parameters such as parameter space, one might consider taking a file in a specific format rather than conjuring up a complex string encoding to pass as CLI input.
For instance for the 20 x 20
for x_1
and x_2
in the example above, rather than taking the parameter space as:
$ torchx run hpo.bayesian --parameter_space x_1=-10:10,x2_=-10:10
One can take it as a well defined python parameter file:
# params.py
# just defines the parameters using the regular Ax APIs
parameters: List[Parameter] = [
RangeParameter(
name="x1",
lower=-10.0,
upper=10.0,
parameter_type=ParameterType.FLOAT,
),
RangeParameter(
name="x2",
lower=-10.0,
upper=10.0,
parameter_type=ParameterType.FLOAT,
),
]
Alternatives
Document how users can write their own hpo application and instruct them to run it with torchx run utils.python
since the hpo driver application only needs to run from a single process.
Additional context/links
See hyperlinks above.