Investigate GPU Offloading

Although algorithm (static) class templates should not care about where computation is performed (CPU or GPU), I think there are a few design choices that motivate parameterizing the algorithm itself instead of the `matrix` class. However, there are still reasons for parameterizing the `matrix` class, one of which is because polymorphic data containers (`kokkos` for example) do the same thing, and such data types should be able to plug into distributed-memory algorithms without any pain.

Think about the pros and cons here.

Three policy classes for offloading (just `gemm` for now) include:
1.   `NoOffload` (default)
2.  `OffloadKeepDataResident` (keep data on GPU as much as possible. Any communication of data on GPU is not a problem, but remember that it still must pass through PCI bus, exploit pinned memory via buffer allocated at the beginning of the program that is used repeatedly).
3.  `OffloadTransferData` (make not attempt to keep data resident on GPUs. Offload for each `gemm` invocation. Mainly a sanity-check policy class)

*  Don't forget to Incorporate into `validate` class templates as well.
*  Modify all test.cpp files to initially allocate memory on device.
*  Modify all Makefiles to use nvcc compiler and corresponding flags. Note that anything compiled with `nvcc` must be separate from anything compiled with `MPI`
*  Update `blas` directory to allow `cuBlas` headers.
*  Update bench/ files to include Offload policy class
*  Replace any `syrk` or `trmm` calls with `gemm`? Will that interfere with algorithm-specific policies (via non-orthogonal policy classes)?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly