Hierarchical Inter-Operator Scheduler (HIOS)

To accelerate CNN inference on large image input, a common strategy is to reduce image size, which results in lower accuracy. However, many applications such as remote sensing, and biomedical analysis where high-resolution images are quite common, need high accuracy as well as low latency.

To lower the inference latency of the directed acyclic graph (DAG) structured DNN model, Inter-operator parallelism is an effective method when the input image size is small. However, since the computational workload of a CNN operator on a large input image is very high, it is hard to parallelize operators in a Single GPU. During our experiment, we found that, for large input images, parallelizing two identical convolution operators even increases the latency instead of lowering latency than the sequential execution.

In our work HIOS, we study the scope of inter-operator parallelism beyond a single GPU to accelerate DAG structured model for high resolution input images. The main bottleneck of inter-operator parallelism on multiple GPUs is communication latency. However, with the new technological advancements of high-speed interconnect (NVLink), the communication latency is not as high as before. This opens up new opportunities for inter-operator parallelism on multiple GPUs connected by high-speed interconnect (NVLink) and we design the Hierarchical Inter-Operator Scheduler (HIOS) to automatically schedule the execution of multiple operators in parallel in such platform.

Scheduling Challenges:

Spatially, mapping operators onto GPUs
- Allocate independent operators onto different GPUs
  - To improve the degree of parallelism
- Assign dependent operators onto the same GPU
  - To minimize data transfer time
Temporally, assigning operators in streams
- avoid hardware under-utilization and resource contention in a single GPU
- Maintain dependencies between operators on different GPUs.

The interaction between spatial and temporal optimization makes the joint two-dimensional optimization problem very intractable.

Methodology:

HIOS works on two steps

Inter-GPU inter-operator parallelization
- Longest-path-based operator scheduling
- In this step, operators are mapped to GPUs
Intra-GPU inter-operator parallelization
- Strats when operators are already mapped onto a GPU in the previous step
- In this step, operators are mapped to Streams within a GPU
- On a sliding window,it finds operators to parallelize into Streams within a GPU

Implementation

Please follow this section to run the HIOS on Multiple GPUs connected via NVLink. We conduct our experiment on two GPUs connected via NVLink in a server. This code base is not tested for more than two GPUs due to unavailability of more GPUs connected via NVLinK.

Environment Prerequisites

CUDA Toolkit 10.0 or higher
cuDNN 7.6.5 or higher
CUPTI
Cuda-aware MPI

HIOS system Components

There are two major components of HIOS system

Python based scheduler
- Takes input computation graph defined in python
- Parse the Graph into nodes and edges
- Profile each nodes, edges and sub graph by using execution engine (another component of HIOS)
- Using Profiling information, HIOS algorithm generate schedule
- The schedules generated by HIOS are in JSON format
Execution engine,a Cuda-aware MPI application
- It takes input schedules of Nodes, edges and graph in JSON format
- It executes the schedules and return latency
- Before measuring latency, it does warm-up execution of the schedules multiple times

Instruction to Run:

Please follow the following instruction to profile Inception_v3, NASNet and Randwire computation graph

Download souce code from github
Go to executor folder and set cuda and CUPTI library path in buildfile.sh
Run sh buildfile.sh in terminal. It will generate the executanle file of execution engine( a Cuda-aware MPI application)
For measuring latency for Inception_v3, NASNet and Randwire run sh run_expr_batchsize.sh on parent folder
Input image size and batch size are configurable in run_expr_batchsize.sh file
The schedule and optimization cost will be generated in output folder

For custom computation Graph

Define your computation graph in main.py file in parent folder like following example
Build the execution engine
Run python main.py in parent folder
The schedule and optimization cost will be generated in output folder

def sample_network():
    v = placeholder(output_shape=(1, 500, 500))
    block = Block(enter_node=v.node)
    v1 = conv2d(block, inputs=[[v]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    v2 = conv2d(block, inputs=[[v]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    v3 = conv2d(block, inputs=[[v]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    v1 = conv2d(block, inputs=[[v1]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    out = identity(block, inputs=[[v1, v2],[v3]], is_exit=True)  # reduce v1, v2, and concat v3
    block1 = Block(enter_node=out.node)
    v11 = conv2d(block1, inputs=[[out]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    v21 = conv2d(block1, inputs=[[out]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    v31 = conv2d(block1, inputs=[[out]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    v11 = conv2d(block1, inputs=[[v11]], out_channels=1, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')
    out1 = identity(block1, inputs=[[v11, v21, v31]], is_exit=True)  # reduce v1, v2, and v3
    graph = Graph(name="demo", input=v.node, blocks=[block, block1])
  
    return graph


def main():

    graph = sample_network()
    opt_type = "hios_lp"
    batch_size = 1
    height = 500
    width = 500
    ngpu = 2
    device = "v100"
    t1 = time.time()
    sys.setrecursionlimit(7000)
    optimize(height, width, graph, opt_type, batch_size=batch_size, warmup=2, number=6, repeat=6, ngpu =ngpu, device = device)
    t2 = time.time()
    optimization_cost = t2 - t1
    print("Optimization cost::")
    print(optimization_cost)

    dump_results(graph.name, height, width, optimization_cost, opt_type, batch_size=1, warmup=2, number=2, repeat=6, ngpu=ngpu, device = device)

Results

The following figure shows inference latency for Inceptiont v3 and NASNet network for different image sizes

The following figure shows performance gain for small and Large image input

The following figure shows the optimization cost for Inception v3 and NASNet network

Citation

If you find that HIOS helps your research, please consider citing it:


@inproceedings{Kundu:HierInteOperSchedRTInferenceDAGDLMdlMultGPU:Cluster23,
  author =    {Turja Kundu and Tong Shu},
  title =     {HIOS: Hierarchical Inter-Operator Scheduler for Real-Time Inference of DAG-Structured Deep Learning Models on Multiple GPUs},
  booktitle = {Proc. of the 25th IEEE International Conference on Cluster Computing },
  series =    {Cluster},
  year =      {2023},
  address =   {Santa Fe, NM, USA},
  month =     {Nov},
  numpages =  {12},
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
executor		executor
simulation		simulation
utils		utils
README.md		README.md
graph_json		graph_json
main.py		main.py
run_expr_batchsize.sh		run_expr_batchsize.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hierarchical Inter-Operator Scheduler (HIOS)

Scheduling Challenges:

Methodology:

Implementation

Environment Prerequisites

HIOS system Components

Instruction to Run:

Results

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SHUs-Lab/HIOS

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Inter-Operator Scheduler (HIOS)

Scheduling Challenges:

Methodology:

Implementation

Environment Prerequisites

HIOS system Components

Instruction to Run:

Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages