Intra-Streaming-Multiprocessor (IntraSM) Engine

Contributing

Setuptools Development Mode

This repository uses setup.py to build the package. To develop, install the package in editable mode:

pip install -e .

Python setup.py develop vs install - StackOverflow

Installation

We need scikit-sparse in some data-exploring and benchmark code. It needs additional step to pip install. Check the repo page for more details.

Pybind Overloading

Binding Disambiguation - pybind11 Documentation

Adding Lambda Function as a Class Method - pybind11 Documentation

Pybind Resource Management

Smart Pointers

Directory Structure

This repository follows the directory structure of TransformerEngine - Github.

Profiling

yqhu/profiler-workshop - Github provides examples on using PyTorch profiler (in Huggingface models) to profile the model.

hf_pipeline_prof.py demonstrates how to export the profiling results as json traces and FlameGraph.
hf_training_trainer_prof.py demonstrates how to profile a huggingface model via registering TrainerCallback.
hf_training_torch_prof.py demonstrates how to run the Huggingface model in steps and profile it via PyTorch profiler in native manner.

Nsight Compute Flags

Consider using the following to obtain Nsight Compute profiling results where inter-kernel interference is recorded.

--cache-control none --replay-mode application

Code Health Badges

Library Supports

Multistream

We have incorporated stream switch support in our custom SparTa repo.

For Cutlass, the Python interface does have stream support, but it is not exposed to the top-level API. We filed a PR, now merged, to expose the stream support to the top-level API.

CUDA Library Determinism

We don't preserve cuBLAS determinism for now.

cuBLAS has the determinism issue, which requires either 1) one handle per stream or 2) one workspace per stream. In cupy, no workspace setting API is exposed, and each device got a default handle. We also need to check if there is any necessary additional handling in PyTorch to guarantee determinism.

No reported issue about cusparse determinism, and I guess the reason is it is deterministic because it has a specific bufferSize allocation operation for each SpMM operation.

Setting CarveOut

Shared memory configurations by the driver are more coarse-grained than those set by in the occupancy calculation. As A100 supports 0/16KB/32KB/64KB/100KB/132KB/164KB, we may want to set the carveout to 100% for the best performance.

To achieve this, append the following code to GemmRTBase.initialize() definition in python/cutlass/backend/gemm_operation.py in NVIDIA/cutlass - Github

Documentation explaining CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT : "On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory."

err, = cuda.cuFuncSetAttribute(
    self.kernel,
    attrib=cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT,
    value=100)
if err != cuda.CUresult.CUDA_SUCCESS:
    raise RuntimeError(
        f"CUDA error on call to cuFuncSetAttribute: {cuda.cuGetErrorString(err)[1]}"
    )
# Set function cache preference
err, = cuda.cuFuncSetCacheConfig(
    self.kernel, cuda.CUfunc_cache.CU_FUNC_CACHE_PREFER_SHARED)
if err != cuda.CUresult.CUDA_SUCCESS:
    raise RuntimeError(
        f"CUDA error on call to cuFuncSetCacheConfig: {cuda.cuGetErrorString(err)[1]}"
    )

Reference

Is a pool of cuBLAS handles required for stream parallelism? #4676 cuBLAS reproducibility cuSparse documentation

Auto-tuning

Microbenchmark

We use the Accel-Sim microbenchmark suites, which is based on "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking"

Contact

Kun Wu kunwu2 (at) illinois (dot) edu

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
3rdparty		3rdparty
benchmark		benchmark
intrasm_engine		intrasm_engine
shell		shell
tests		tests
.clang-format		.clang-format
.deepsource.toml		.deepsource.toml
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intra-Streaming-Multiprocessor (IntraSM) Engine

Contributing

Setuptools Development Mode

Installation

Pybind Overloading

Pybind Resource Management

Directory Structure

Profiling

Nsight Compute Flags

Code Health Badges

Library Supports

Multistream

CUDA Library Determinism

Setting CarveOut

Reference

Auto-tuning

Microbenchmark

Contact

About

Releases

Packages

Languages

License

K-Wu/intrasm_engine

Folders and files

Latest commit

History

Repository files navigation

Intra-Streaming-Multiprocessor (IntraSM) Engine

Contributing

Setuptools Development Mode

Installation

Pybind Overloading

Pybind Resource Management

Directory Structure

Profiling

Nsight Compute Flags

Code Health Badges

Library Supports

Multistream

CUDA Library Determinism

Setting CarveOut

Reference

Auto-tuning

Microbenchmark

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages