A yaml file which contains a list of standup and run parameters of interest, termed factors and a list of values of interest, termed levels for each one of the factors. The each set of values for each factor produces a list of combinations termed treatments. These concepts and nomenclature follow the "Design of Experiments" (DOE) approach, and it allows a systematic and reproducible investigation on how different parameters affect the overall performance of a stack.
While the triplet <scenario>,<harness>,<(workload) profile>, contains all information required for the llm-d-benchmark to be able to carry out a standup->run->teardown lifecycle, in order to compare and validate the performance of different stacks, a large number of parameters on llm-d must be swept. Hence, the need for an automated mechanism to loop through this (potentially) large parameter space.
An experiment file has to be manually crafted as a yaml. Once crafted, it can used by the e2e.sh executable. It access is controlled by the following parameters:
Note
./e2e.sh (executable which combines ./setup/standup.sh, run.sh and setup/teardown.sh) is the only one that can have an experiment file supplied to it.
| Variable | Meaning | Note |
|---|---|---|
| LLMDBENCH_HARNESS_EXPERIMENT_TREATMENTS | yaml file containing an experiment description |
Can be overriden with CLI parameter -e/--experiments |
Tip
In case the full path is ommited for the experiment file (either by setting LLMDBENCH_HARNESS_EXPERIMENT_TREATMENTS or CLI parameter -e/--experiments, it is assumed that the file exists inside the experiments folder
- Compare
standalonevllm withllm-din a stack with a variable number ofprefillanddecodepods. Each time a new combination is deployed, run a workload profile with varyingmax-concurrecyandnum-prompts
Important
The harness - vllm-benchmark and (workload) profile (random_concurrent) are not defined here, but on the scenario
setup:
factors:
- LLMDBENCH_DEPLOY_METHODS
- LLMDBENCH_VLLM_COMMON_REPLICAS
- LLMDBENCH_VLLM_COMMON_ACCELERATOR_NR
- LLMDBENCH_VLLM_MODELSERVICE_PREFILL_REPLICAS
- LLMDBENCH_VLLM_MODELSERVICE_PREFILL_ACCELERATOR_NR
- LLMDBENCH_VLLM_MODELSERVICE_DECODE_REPLICAS
- LLMDBENCH_VLLM_MODELSERVICE_DECODE_ACCELERATOR_NR
levels:
LLMDBENCH_VLLM_COMMON_REPLICAS: "2,4"
LLMDBENCH_VLLM_COMMON_ACCELERATOR_NR: "8"
LLMDBENCH_VLLM_MODELSERVICE_PREFILL_REPLICAS: "2,4,6,8"
LLMDBENCH_VLLM_MODELSERVICE_PREFILL_ACCELERATOR_NR: "1,2"
LLMDBENCH_VLLM_MODELSERVICE_DECODE_REPLICAS: "1,2,4"
LLMDBENCH_VLLM_MODELSERVICE_DECODE_ACCELERATOR_NR: "2,4,8"
treatments:
- "modelservice,NA,NA,6,2,1,4"
- "modelservice,NA,NA,4,2,1,8"
- "modelservice,NA,NA,8,1,1,8"
- "modelservice,NA,NA,4,2,2,4"
- "modelservice,NA,NA,4,2,4,2"
- "modelservice,NA,NA,2,2,4,4"
- "standalone,2,8,NA,NA,NA,NA"
- "standalone,4,8,NA,NA,NA,NA"
run:
factors:
- max-concurrency
- num-prompts
levels:
max-concurrency: "1,4,8,16,32,64,128,256,512,1024"
num-prompts: "10,40,80,160,320,640,1280,2560,5120,10240"
treatments:
- "1,10"
- "4,40"
- "8,80"
- "16,160"
- "32,320"
- "64,640"
- "128,1280"
- "256,2560"
- "512,5120"
- "1024,10240"
Note
The NA ("Not Applicable") is used to explicitate stand up parameters not used by particular methods (e.g., LLMDBENCH_VLLM_COMMON_REPLICAS is not really used when standing up an llm-d stack via modelservice).
** This particular example can be used with the following command :
./e2e.sh --scenario disaggregated_vs_llmd --experiments disaggregated_vs_llmd
- Compare different parameters for GAIE (Gateway API Inference Extension), using a fixed set of
decodepods. Once deployed, run a workload profile varyingnum_groupsandsystem_prompt_len)
Important
The harness - inference-perf and (workload) profile (shared_prefix_synthetic) are not defined here, but on the scenario
setup:
factors:
- LLMDBENCH_VLLM_MODELSERVICE_GAIE_PRESETS
levels:
LLMDBENCH_VLLM_MODELSERVICE_GAIE_PRESETS: "default,prefix-cache-estimate-config,prefix-cache-tracking-config"
treatments:
- "default"
- "prefix-cache-estimate-config"
- "prefix-cache-tracking-config"
run:
factors:
- num_groups
- system_prompt_len
levels:
num_groups: "40,60"
system_prompt_len: "80000,5000,1000"
treatments:
- "40,8000"
- "60,5000"
- "60,1000"
** This particular example can be used with the following command
./e2e.sh --scenario precise-prefix-cache-aware --experiments precise-prefix-cache-aware