Skip to content

Commit 94dd14b

Browse files
committed
Add the project README and improve examples
Closes #9
1 parent cf4c204 commit 94dd14b

File tree

5 files changed

+335
-38
lines changed

5 files changed

+335
-38
lines changed

README.md

Lines changed: 304 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,304 @@
1-
# GPU-memory-access-visualization
1+
<a name="readme-top"></a>
2+
3+
[![Contributors][contributors-shield]][contributors-url]
4+
[![Forks][forks-shield]][forks-url]
5+
[![Stargazers][stars-shield]][stars-url]
6+
[![Issues][issues-shield]][issues-url]
7+
[![MPL-2.0 License][license-shield]][license-url]
8+
9+
<br />
10+
<div align="center">
11+
12+
13+
<h3 align="center">GPU-memory-access-visualization</h3>
14+
15+
<p align="center">
16+
A single header CUDA library that allows logging individual memory accesses of a GPU kernel with as little changes to the code as possible. An export to JSON together with a web-based visualization is included and allows for easy analysis of the memory access patterns.
17+
<br />
18+
<a href="https://github.com/NetroScript/GPU-memory-access-visualization/tree/master/html"><b>View Documentation for the visualization</b></a>
19+
·
20+
<a href="https://github.com/NetroScript/GPU-memory-access-visualization/issues">Report Bug</a>
21+
·
22+
<a href="https://github.com/NetroScript/GPU-memory-access-visualization/issues">Request Feature</a>
23+
</p>
24+
</div>
25+
26+
<!-- TOC -->
27+
* [About The Project](#about-the-project)
28+
* [Usage](#usage)
29+
* [Simple Example](#simple-example)
30+
* [Example Kernel](#example-kernel)
31+
* [Creating an instance to store the memory](#creating-an-instance-to-store-the-memory)
32+
* [Wrapping the data arrays](#wrapping-the-data-arrays)
33+
* [Changing the kernel](#changing-the-kernel)
34+
* [Getting the data](#getting-the-data)
35+
* [Full example](#full-example)
36+
* [Gotchas](#gotchas)
37+
* [Passing a `CudaMemAccessLogger` pointer to a kernel](#passing-a-cudamemaccesslogger-pointer-to-a-kernel)
38+
* [Doing operations besides just assignment](#doing-operations-besides-just-assignment)
39+
* [Synchronizing the device](#synchronizing-the-device)
40+
* [Contributing](#contributing)
41+
* [License](#license)
42+
<!-- TOC -->
43+
44+
45+
## About The Project
46+
47+
![Application Preview](https://user-images.githubusercontent.com/18115780/218279005-7b91f1ed-f029-4e75-90d8-c6d1c5dcc3fc.png)
48+
49+
This repository contains a single header CUDA library that allows logging individual memory accesses of a GPU kernel with as little changes to the code as possible. **Internally the library uses CUDA Unified Memory to store the memory access information. Because of this, please make sure your targeted architecture supports this feature.** The library was tested on a GeForce RTX 2070 Ti and on a GTX 1060.
50+
51+
The overall design was to require as little changes to the code as possible. The concrete usage is shown in the [Usage section](#usage).
52+
53+
The library takes care of storing all the memory accesses using the provided data structure (which is almost equivalent to a normal array). Besides that, the library provides functionality to store this data to the filesystem. By default, an extremely basic HTML output is provided together with a JSON output. The JSON output can be used to create a custom visualization, this repository already includes one application which can visualize this data in a browser. You can find that application in the `html` [html](https://github.com/NetroScript/GPU-memory-access-visualization/tree/master/html) together with a documentation on how to use it. For easier usage, the releases section already contains on default pre-built version of the application.
54+
55+
Should these data formats not be sufficient for your needs, you can easily pass in a custom callback function to the library. This callback function will be called for every memory access and can be used to store the data in any format you like. To get an idea how to use it (as it is not documented), you can take a look inside the `generateTemplatedOutput` function.
56+
57+
58+
## Usage
59+
60+
This section shows the most basic usage of the library for use with CUDA code. If you want to know instead how to use the visualization, please take a look at the [documentation here](https://github.com/NetroScript/GPU-memory-access-visualization/tree/master/html).
61+
62+
For complete working examples you can also take a look at the `examples` folder. This folder contains multiple simple CUDA applications that already use the library (files ending with `_original` are the cuda files before adapting them to use the kernel).
63+
64+
To use the library, just include the header file `cuda_mav.cuh` in your project. The library is a single header file and does not require any additional files to be included.
65+
66+
### Simple Example
67+
68+
#### Example Kernel
69+
70+
Let's assume we have the following code right now:
71+
72+
```cpp
73+
74+
__global__ void kernel(int* data, int* result) {
75+
int idx = threadIdx.x + blockIdx.x * blockDim.x;
76+
result[idx] = data[idx] * 2;
77+
}
78+
79+
int main() {
80+
// Data on the host
81+
int* h_data = new int[100];
82+
int* h_result = new int[100];
83+
84+
// Data on the device
85+
int* d_data;
86+
int* d_result;
87+
88+
// Allocate memory on the device
89+
cudaMalloc(&d_data, 100 * sizeof(int));
90+
cudaMalloc(&d_result, 100 * sizeof(int));
91+
92+
// Copy data to the device
93+
cudaMemcpy(d_data, h_data, 100 * sizeof(int), cudaMemcpyHostToDevice);
94+
95+
// Execute the kernel
96+
kernel<<<10, 10>>>(d_data, d_result);
97+
98+
// Copy data back to the host
99+
cudaMemcpy(h_result, d_result, 100 * sizeof(int), cudaMemcpyDeviceToHost);
100+
101+
// Free the device memory
102+
cudaFree(d_data);
103+
cudaFree(d_result);
104+
105+
// Free the host memory
106+
delete[] h_data;
107+
delete[] h_result;
108+
}
109+
```
110+
111+
We can leave the old code almost entirely intact and only slightly need to change the kernel and kernel call.
112+
113+
#### Creating an instance to store the memory
114+
115+
The first thing you need to do is create a new instance of a `CudaMemAccessStorage`. This class stores all the memory access information, and you will need to pass in an expected size for all the memory accesses. This is mainly limited by your available host memory, as internally the library uses CUDA Unified Memory to store the data.
116+
117+
```cpp
118+
// We are using auto with make_unique here to automatically free the memory when the scope ends
119+
// You can also use a normal pointer and free it manually
120+
auto memAccessStorage = std::make_unique<CudaMemAccessStorage<int>>(100*2);
121+
```
122+
123+
#### Wrapping the data arrays
124+
125+
After that we need to wrap our data arrays in our custom class `CudaMemAccessLogger`, provide the length of the array, and a custom description / name for the visualization and provide the Logger with a reference to a `CudaMemAccessStorage` instance. This class will then intercept all accesses to the original data using a proxy. The proxy class will then forward the memory access information to the `CudaMemAccessStorage` instance.
126+
127+
```cpp
128+
// Wrap the (device) data arrays in the CudaMemAccessLogger class
129+
// Get the object itself from the smart pointer first
130+
auto data = CudaMemAccessLogger<int>(d_data, 100, "Input data", *memAccessStorage);
131+
auto result = CudaMemAccessLogger<int>(d_result, 100, "Result data", *memAccessStorage);
132+
// Once again, you can also use a normal pointer here, but then you need to make sure to free the memory manually
133+
// As the CudaMemAccessLogger class does not allocate any memory, you do not need to use a smart pointer here
134+
```
135+
136+
#### Changing the kernel
137+
138+
Next we need to change the kernel slightly to take in the `CudaMemAccessLogger` instances instead of the original data arrays.
139+
140+
```cpp
141+
// This is all you need to change
142+
__global__ void kernel(CudaMemAccessLogger<int> data, CudaMemAccessLogger<int> result) {
143+
```
144+
145+
Additionally, we now need to change the call to the kernel to pass in the wrapped data arrays instead of the original ones.
146+
147+
```cpp
148+
kernel<<<10, 10>>>(data, result);
149+
```
150+
151+
#### Getting the data
152+
153+
Now the code continues working as expected, but you also want to get the stored data of the accesses.
154+
155+
For this you can just use the `generateJSONOutput` function which you would want to use in the most cases as this produces just one JSON file which you can then drag and drop into web based visualization.
156+
157+
```cpp
158+
// Get the data from the storage
159+
// Make sure the kernel has finished executing before calling this function
160+
memAccessStorage->generateJSONOutput("./my_data_output.json");
161+
```
162+
163+
If you instead for example want to use the HTML template to directly embed the data in the HTML file already _(warning: loading then is much slower)_ the code would look like this:
164+
165+
```cpp
166+
memAccessStorage->generateTemplatedOutput("./path_to_template_file.html", "./path_to_output_file.html", CudaMemAccessStorage<int>::parseDataForJSPage)
167+
```
168+
169+
#### Full example
170+
171+
Click below to open the spoiler and see the full example code.
172+
173+
<details>
174+
<summary>Full example code</summary>
175+
176+
```cpp
177+
#include <memory>
178+
#include "cuda_mav.cuh"
179+
180+
__global__ void kernel(CudaMemAccessLogger<int> data, CudaMemAccessLogger<int> result) {
181+
int idx = threadIdx.x + blockIdx.x * blockDim.x;
182+
result[idx] = data[idx] * 2;
183+
}
184+
185+
int main() {
186+
// Data on the host
187+
int* h_data = new int[100];
188+
int* h_result = new int[100];
189+
190+
// Data on the device
191+
int* d_data;
192+
int* d_result;
193+
194+
// Allocate memory on the device
195+
cudaMalloc(&d_data, 100 * sizeof(int));
196+
cudaMalloc(&d_result, 100 * sizeof(int));
197+
198+
// Copy data to the device
199+
cudaMemcpy(d_data, h_data, 100 * sizeof(int), cudaMemcpyHostToDevice);
200+
201+
auto memAccessStorage = std::make_unique<CudaMemAccessStorage<int>>(100*2);
202+
auto data = CudaMemAccessLogger<int>(d_data, 100, "Input data", *memAccessStorage);
203+
auto result = CudaMemAccessLogger<int>(d_result, 100, "Result data", *memAccessStorage);
204+
205+
// Execute the kernel
206+
kernel<<<10, 10>>>(data, result);
207+
208+
// Copy data back to the host
209+
cudaMemcpy(h_result, d_result, 100 * sizeof(int), cudaMemcpyDeviceToHost);
210+
211+
// Store the memory access data
212+
memAccessStorage->generateJSONOutput("./my_data_output.json");
213+
214+
// Free the device memory
215+
cudaFree(d_data);
216+
cudaFree(d_result);
217+
218+
// Free the host memory
219+
delete[] h_data;
220+
delete[] h_result;
221+
}
222+
```
223+
224+
As you can see only 4 lines of code were added, and 2 lines of code were changed. The remaining code is the same as before.
225+
226+
</details>
227+
228+
229+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
230+
231+
### Gotchas
232+
233+
There are some things you need to watch out for when using this library.
234+
The two main things have to do with how the submitted array is wrapped.
235+
236+
#### Passing a `CudaMemAccessLogger` pointer to a kernel
237+
238+
Assuming you pass in a pointer of a `CudaMemAccessLogger` instance, instead of the instance itself, you will need to dereference the pointer before using it, as otherwise the array operator is not called and then wrong memory is accessed.
239+
240+
The previously shown example code would then look like this:
241+
242+
```cpp
243+
__global__ void kernel(CudaMemAccessLogger<int>* data, CudaMemAccessLogger<int>* result) {
244+
int idx = threadIdx.x + blockIdx.x * blockDim.x;
245+
(*result)[idx] = (*data)[idx] * 2;
246+
}
247+
```
248+
249+
#### Doing operations besides just assignment
250+
251+
The wrapper class only implements assigning to the templated type, or assigning to another instance of the wrapper class.
252+
This means you can only use the `=` operator. If you for example want to use the `++` operation, you will have to change your kernel from:
253+
254+
```cpp
255+
__global__ void kernel(CudaMemAccessLogger<int> data) {
256+
int idx = threadIdx.x + blockIdx.x * blockDim.x;
257+
data[idx]++;
258+
}
259+
```
260+
261+
to this:
262+
263+
```cpp
264+
__global__ void kernel(CudaMemAccessLogger<int> data) {
265+
int idx = threadIdx.x + blockIdx.x * blockDim.x;
266+
data[idx] = data[idx] + 1;
267+
}
268+
```
269+
270+
#### Synchronizing the device
271+
272+
The library does not synchronize the device after each kernel call. This means that if you want to get the data from the device, you need to synchronize the device manually. You do this either by explicitly calling `cudaDeviceSynchronize()` before using any of the `CudaMemAccessStorage` functions to output the data, or you can just place the call to for example `generateJSONOutput` below a synchronous memory operation, like `cudaMemcpy`.
273+
274+
## Contributing
275+
276+
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
277+
278+
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
279+
Don't forget to give the project a star! Thanks again!
280+
281+
1. Fork the Project
282+
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
283+
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
284+
4. Push to the Branch (`git push origin feature/AmazingFeature`)
285+
5. Open a Pull Request
286+
287+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
288+
289+
## License
290+
291+
Distributed under the MPL-2.0 License. See `LICENSE.md` for more information.
292+
293+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
294+
295+
[contributors-shield]: https://img.shields.io/github/contributors/NetroScript/GPU-memory-access-visualization.svg?style=for-the-badge
296+
[contributors-url]: https://github.com/NetroScript/GPU-memory-access-visualization/graphs/contributors
297+
[forks-shield]: https://img.shields.io/github/forks/NetroScript/GPU-memory-access-visualization.svg?style=for-the-badge
298+
[forks-url]: https://github.com/NetroScript/GPU-memory-access-visualization/network/members
299+
[stars-shield]: https://img.shields.io/github/stars/NetroScript/GPU-memory-access-visualization.svg?style=for-the-badge
300+
[stars-url]: https://github.com/NetroScript/GPU-memory-access-visualization/stargazers
301+
[issues-shield]: https://img.shields.io/github/issues/NetroScript/GPU-memory-access-visualization.svg?style=for-the-badge
302+
[issues-url]: https://github.com/NetroScript/GPU-memory-access-visualization/issues
303+
[license-shield]: https://img.shields.io/github/license/NetroScript/GPU-memory-access-visualization.svg?style=for-the-badge
304+
[license-url]: https://github.com/NetroScript/GPU-memory-access-visualization/blob/master/LICENSE.md

examples/basic/normal.cu

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ inline void checkCudaError(cudaError_t err) {
1212
}
1313

1414

15-
__global__ void kernel(int prob_size, CudaMemAccessLogger<int> * input, CudaMemAccessLogger<int> * output){
15+
__global__ void kernel(int prob_size, CudaMemAccessLogger<int>* input, CudaMemAccessLogger<int>* output){
1616
int id = threadIdx.x + blockIdx.x * blockDim.x;
1717
if (id < prob_size) {
1818
(*output)[id] = (*input)[id];
@@ -38,14 +38,14 @@ int main(){
3838
CudaMemAccessStorage<int>* memAccessStorage = new CudaMemAccessStorage<int>(10000);
3939

4040
// The overloaded new operator generates a managed memory object
41-
CudaMemAccessLogger<int>* input = new CudaMemAccessLogger<int>(d_input, prob_size, "Input Datastructure", memAccessStorage);
42-
CudaMemAccessLogger<int>* output = new CudaMemAccessLogger<int>(d_output, prob_size, "Output Datastructure", memAccessStorage);
41+
CudaMemAccessLogger<int>* input = new CudaMemAccessLogger<int>(d_input, prob_size, "Input Datastructure", *memAccessStorage);
42+
CudaMemAccessLogger<int>* output = new CudaMemAccessLogger<int>(d_output, prob_size, "Output Datastructure", *memAccessStorage);
4343

4444
constexpr int threads = 32;
4545
constexpr int blocks = (prob_size/threads)+1;
4646

4747
//kernel<<<blocks, threads>>>(prob_size, input.getDevicePointer(), output.getDevicePointer());
48-
kernel << <blocks, threads >> > (prob_size, input, output);
48+
kernel <<<blocks, threads >>> (prob_size, input, output);
4949
checkCudaError(cudaGetLastError());
5050
cudaDeviceSynchronize();
5151

examples/basic/random2dMatrix.cu

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,17 +23,17 @@ inline void checkCudaErrorFunc(cudaError_t err, const char *file, int line)
2323
}
2424
}
2525

26-
__global__ void decrement(unsigned int const size, CudaMemAccessLogger<unsigned int> *data, CudaMemAccessLogger<unsigned int> *control)
26+
__global__ void decrement(unsigned int const size, CudaMemAccessLogger<unsigned int> data, CudaMemAccessLogger<unsigned int> control)
2727
{
2828
int index = threadIdx.x + blockIdx.x * blockDim.x;
2929
int stride = blockDim.x * gridDim.x;
3030

3131
for (int i = index; i < size; i += stride)
3232
{
33-
while ((*data)[i] > 0)
33+
while (data[i] > 0)
3434
{
35-
(*data)[i] = (*data)[i] - 1;
36-
(*control)[i] = (*control)[i] + 1;
35+
data[i] = data[i] - 1;
36+
control[i] = control[i] + 1;
3737
}
3838
}
3939
}
@@ -125,8 +125,8 @@ int main(int argc, char **argv)
125125
auto* memAccessStorage = new CudaMemAccessStorage<unsigned int>(dim * dim * 50);
126126

127127
// Wrap the data classes with the custom logging class
128-
auto* data = new CudaMemAccessLogger<unsigned int>(d_data, dim*dim, "Decremented Data", memAccessStorage);
129-
auto* control = new CudaMemAccessLogger<unsigned int>(d_control, dim*dim, "Control Data", memAccessStorage);
128+
auto data = CudaMemAccessLogger<unsigned int>(d_data, dim*dim, "Decremented Data", *memAccessStorage);
129+
auto control = CudaMemAccessLogger<unsigned int>(d_control, dim*dim, "Control Data", *memAccessStorage);
130130

131131
// change me and look, how the visulization looks like
132132
int const blockSize = 32;
@@ -182,8 +182,6 @@ int main(int argc, char **argv)
182182

183183
// Free the memory of the custom logging classes
184184
delete memAccessStorage;
185-
delete data;
186-
delete control;
187185

188186
return 0;
189187
}

0 commit comments

Comments
 (0)