Although algorithm (static) class templates should not care about where computation is performed (CPU or GPU), I think there are a few design choices that motivate parameterizing the algorithm itself instead of the `matrix` class. However, there are still reasons for parameterizing the `matrix` class, one of which is because polymorphic data containers (`kokkos` for example) do the same thing, and such data types should be able to plug into distributed-memory algorithms without any pain. Think about the pros and cons here. Three policy classes for offloading (just `gemm` for now) include: 1. `NoOffload` (default) 2. `OffloadKeepDataResident` (keep data on GPU as much as possible. Any communication of data on GPU is not a problem, but remember that it still must pass through PCI bus, exploit pinned memory via buffer allocated at the beginning of the program that is used repeatedly). 3. `OffloadTransferData` (make not attempt to keep data resident on GPUs. Offload for each `gemm` invocation. Mainly a sanity-check policy class) * Don't forget to Incorporate into `validate` class templates as well. * Modify all test.cpp files to initially allocate memory on device. * Modify all Makefiles to use nvcc compiler and corresponding flags. Note that anything compiled with `nvcc` must be separate from anything compiled with `MPI` * Update `blas` directory to allow `cuBlas` headers. * Update bench/ files to include Offload policy class * Replace any `syrk` or `trmm` calls with `gemm`? Will that interfere with algorithm-specific policies (via non-orthogonal policy classes)?