Skip to content

Depending on PyTorch 2.3 for now but updating to 2.4

License

Notifications You must be signed in to change notification settings

K-Wu/intrasm_engine

Repository files navigation

Intra-Streaming-Multiprocessor (IntraSM) Engine

Contributing

Setuptools Development Mode

This repository uses setup.py to build the package. To develop, install the package in editable mode:

pip install -e .

Python setup.py develop vs install - StackOverflow

Installation

We need scikit-sparse in some data-exploring and benchmark code. It needs additional step to pip install. Check the repo page for more details.

Pybind Overloading

Binding Disambiguation - pybind11 Documentation

Adding Lambda Function as a Class Method - pybind11 Documentation

Pybind Resource Management

Smart Pointers

Directory Structure

This repository follows the directory structure of TransformerEngine - Github.

Profiling

yqhu/profiler-workshop - Github provides examples on using PyTorch profiler (in Huggingface models) to profile the model.

  1. hf_pipeline_prof.py demonstrates how to export the profiling results as json traces and FlameGraph.
  2. hf_training_trainer_prof.py demonstrates how to profile a huggingface model via registering TrainerCallback.
  3. hf_training_torch_prof.py demonstrates how to run the Huggingface model in steps and profile it via PyTorch profiler in native manner.

Nsight Compute Flags

Consider using the following to obtain Nsight Compute profiling results where inter-kernel interference is recorded.

--cache-control none --replay-mode application

Code Health Badges

CodeFactor Codacy Badge DeepSource

Library Supports

Multistream

We have incorporated stream switch support in our custom SparTa repo.

For Cutlass, the Python interface does have stream support, but it is not exposed to the top-level API. We filed a PR, now merged, to expose the stream support to the top-level API.

CUDA Library Determinism

We don't preserve cuBLAS determinism for now.

cuBLAS has the determinism issue, which requires either 1) one handle per stream or 2) one workspace per stream. In cupy, no workspace setting API is exposed, and each device got a default handle. We also need to check if there is any necessary additional handling in PyTorch to guarantee determinism.

No reported issue about cusparse determinism, and I guess the reason is it is deterministic because it has a specific bufferSize allocation operation for each SpMM operation.

Setting CarveOut

Shared memory configurations by the driver are more coarse-grained than those set by in the occupancy calculation. As A100 supports 0/16KB/32KB/64KB/100KB/132KB/164KB, we may want to set the carveout to 100% for the best performance.

To achieve this, append the following code to GemmRTBase.initialize() definition in python/cutlass/backend/gemm_operation.py in NVIDIA/cutlass - Github

Documentation explaining CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT : "On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory."

err, = cuda.cuFuncSetAttribute(
    self.kernel,
    attrib=cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT,
    value=100)
if err != cuda.CUresult.CUDA_SUCCESS:
    raise RuntimeError(
        f"CUDA error on call to cuFuncSetAttribute: {cuda.cuGetErrorString(err)[1]}"
    )
# Set function cache preference
err, = cuda.cuFuncSetCacheConfig(
    self.kernel, cuda.CUfunc_cache.CU_FUNC_CACHE_PREFER_SHARED)
if err != cuda.CUresult.CUDA_SUCCESS:
    raise RuntimeError(
        f"CUDA error on call to cuFuncSetCacheConfig: {cuda.cuGetErrorString(err)[1]}"
    )

Reference

Is a pool of cuBLAS handles required for stream parallelism? #4676 cuBLAS reproducibility cuSparse documentation

Auto-tuning

Microbenchmark

We use the Accel-Sim microbenchmark suites, which is based on "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking"

Contact

Kun Wu kunwu2 (at) illinois (dot) edu wakatime

About

Depending on PyTorch 2.3 for now but updating to 2.4

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published