This repository uses setup.py to build the package. To develop, install the package in editable mode:
pip install -e .
Python setup.py develop vs install - StackOverflow
We need scikit-sparse in some data-exploring and benchmark code. It needs additional step to pip install. Check the repo page for more details.
Binding Disambiguation - pybind11 Documentation
Adding Lambda Function as a Class Method - pybind11 Documentation
This repository follows the directory structure of TransformerEngine - Github.
yqhu/profiler-workshop - Github provides examples on using PyTorch profiler (in Huggingface models) to profile the model.
- hf_pipeline_prof.py demonstrates how to export the profiling results as json traces and FlameGraph.
- hf_training_trainer_prof.py demonstrates how to profile a huggingface model via registering TrainerCallback.
- hf_training_torch_prof.py demonstrates how to run the Huggingface model in steps and profile it via PyTorch profiler in native manner.
Consider using the following to obtain Nsight Compute profiling results where inter-kernel interference is recorded.
--cache-control none --replay-mode application
We have incorporated stream switch support in our custom SparTa repo.
For Cutlass, the Python interface does have stream support, but it is not exposed to the top-level API. We filed a PR, now merged, to expose the stream support to the top-level API.
We don't preserve cuBLAS determinism for now.
cuBLAS has the determinism issue, which requires either 1) one handle per stream or 2) one workspace per stream. In cupy, no workspace setting API is exposed, and each device got a default handle. We also need to check if there is any necessary additional handling in PyTorch to guarantee determinism.
No reported issue about cusparse determinism, and I guess the reason is it is deterministic because it has a specific bufferSize allocation operation for each SpMM operation.
Shared memory configurations by the driver are more coarse-grained than those set by in the occupancy calculation. As A100 supports 0/16KB/32KB/64KB/100KB/132KB/164KB, we may want to set the carveout to 100% for the best performance.
To achieve this, append the following code to GemmRTBase.initialize() definition in python/cutlass/backend/gemm_operation.py in NVIDIA/cutlass - Github
Documentation explaining CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT : "On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory."
err, = cuda.cuFuncSetAttribute(
self.kernel,
attrib=cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT,
value=100)
if err != cuda.CUresult.CUDA_SUCCESS:
raise RuntimeError(
f"CUDA error on call to cuFuncSetAttribute: {cuda.cuGetErrorString(err)[1]}"
)
# Set function cache preference
err, = cuda.cuFuncSetCacheConfig(
self.kernel, cuda.CUfunc_cache.CU_FUNC_CACHE_PREFER_SHARED)
if err != cuda.CUresult.CUDA_SUCCESS:
raise RuntimeError(
f"CUDA error on call to cuFuncSetCacheConfig: {cuda.cuGetErrorString(err)[1]}"
)
Is a pool of cuBLAS handles required for stream parallelism? #4676 cuBLAS reproducibility cuSparse documentation
We use the Accel-Sim microbenchmark suites, which is based on "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking"