Project by lucinder and enbaik.
- I - Introduction
- II - Perlin Noise Algorithm
- III - Hypotheses and System Limitations
- IV - Experimental Setup
- V - Data
- VI - Analysis & Conclusion
- References
Perlin noise is a procedural noisy texture generation algorithm designed by Ken Perlin, originally developed as the Pixel Stream Editor in 1985 [1][2]. Perlin later developed Simplex Noise, which improved this to utilize a simpler space-filling grid in 2001 [3]. Traditionally, this algorithm is applied to terrain and particle generation programs, particularly in game development and landscape ecology [4][5][6], though it has extensive applications in other fields such as microbiology and material sciences [7][8][9][10]. Figures 1 and 2 show examples of the Perlin noise algorithm applied to terrain generation, with Figure 2 using a variant known as value noise
Figure 1: Procedural Terrain Generation Using Perlin Noise
Figure 2: Procedural Terrain Generation Using Value Noise [11]
With the rising popularity of machine learning and deep learning algorithms for various applications, noise generation algorithms are essential in adversarial machine learning [12][13][14], with potential applications in defending against model poisoning attacks [15]. In areas of machine learning, Perlin noise has already seen some applications, particularly in data augmentation [16][17][18]. However, an overlooked factor for scaling this algorithm to modern machine learning algorithms is the dimensionality of the noise in relation to the data, which in many ML systems may exist in much higher dimensionality than 2D or 3D.
Few studies have investigated the parallelization of Perlin noise generation in CUDA [19][20], none of which have scaled the Perlin noise algorithm to n dimensionality in their implementations. As an unexplored avenue of research, creating a parallelized $n$D implementation of Perlin noise could lay a foundation for improved noise generation applications in data augmentation and adversarial machine learning, where data dimensionality may be much higher than the current three-dimensional limit.
The Perlin Noise algorithm can be divided into these major sections for each point in the
- Generate distance vectors from each point to its grid space’s edges.
- Generate the appropriate pseudo-random gradient vectors for each edge.
- Calculate the dot product of each distance vector with each gradient vector.
- Repeatedly perform interpolation in each dimension between dot products until only one value remains.
In subsections here, we explain the theoretical implementation, pseudocode, and
For a given 2-dimensional
Our pseudocode for
distances = empty list
for i in [0, 2n): // loop through edges
for j in [0, n): // loop through dimensions
edge = ((i & (1 << (j - 1))) >> (j - 1)); // get the jth digit of binary i
distances[i*n+j] = point[j] - edge // take the distance
Expanding gradient vector generation to
The gradient vectors generally consist of pseudo-random components determined by a hash function. We encountered difficulty understanding and implementing the gradient function in a fashion consistent with other writings, as many implementations offer different gradient generation methods. However, our final implementation uses the hash function’s pseudo-randomness to map block and thread offsets to random seeding. It uses the hashed seed to randomly generate a set of gradient values from -1 to 1. Once generated, we divide these values by the Euclidean distance of the resulting vector to normalize the vector to a length of 1, mapping it to a n-dimensional unit sphere.
Our pseudocode for
gradients = empty list
srand with hash[(blockDim.x * blockIdx.x + threadIdx.x)%len(hash)]
for i in [0,2n):
euclidean_distance = 0
for i in [0, n): // loop through dimensions
gradients[i*n+j] = rand in (-1, 1)
euclidean_distance += gradients[i*n+j]2
euclidean_distance = sqrt(euclidean_distance)
for i in [0, n):
gradients[i*n+j] = gradients[i*n+j] / euclidean_distance
In this step, we take the dot products of the gradient and distance vectors of each edge to acquire an "influence” vector (denoted as products) for the whole matrix point, showing the influences of each edge on the point with pseudo-randomization from the gradients. This is a fast step, taking
Or,
Our pseudocode for the dot product of the distance and gradient vectors is as follows:
products = empty list
for i in [0, 2n):
dot = 0
for j in [0, n):
dot += distances[i*n+j] * gradients[i*n+j]
products[i] = dot
Two different types of interpolation are commonly used to generate Perlin noise: linear interpolation and cosine interpolation. Cosine interpolation is typically preferred as it provides a smoother fade of different values across the grid. However, we will use linear interpolation for our implementation, as we are primarily interested in performance. Our implementation can be easily modified to use cosine interpolation instead of linear interpolation.
Given that we do not have a set dimensionality, our literal number of iterations for interpolation is unknown but can be modeled as
Our pseudocode for the interpolation is as follows:
step = 1, dimension = 0
while step < 2n:
for i in [0, 2n] incrementing by 2*step:
fraction = point[dimension]
products[i] = products[i] + fraction * (products[i+step] - products[i])
step *= 2
dimension++ // move to the next dimension
The fade function defined by Perlin,
step = 1, dimension = 0
while step < 2n:
for i in [0, 2n] incrementing by 2*step:
f = point[dimension] // our fractional displacement
f = f*f*f*(f*(f*6-15)+10) // apply fade function
prods[i] = prods[i] + f * (prods[i+step] - prods[i])
step *= 2
dimension++ // move to the next dimension
Combining all parts of the pseudocode thus far, our pseudocode for calculating the Perlin noise value for a given point is as follows:
perlin(point):
// generate gradients and distances
distances = empty list
gradients = empty list
srand with hash[(blockDim.x * blockIdx.x + threadIdx.x)%256]
for i in [0, 2n): // loop through edges
euclidean distance = 0
for j in [0, n): // loop through dimensions
index = i*n+j
gradients[index] = rand in (-1, 1)
euclidean distance += gradients[index]2
edge = (i & (1 << (j - 1))) >> (j - 1)); // get the jth digit of binary i
distances[index] = edge - point[j] // take the distance
euclidean distance = sqrt(euclidean distance)
for j in [0, n):
gradients[index] = gradient[index] / euclidean distance
// calculate dot product
prods = empty list
for i in [0, 2n):
dot = 0
for j in [0, n):
dot += edges[i*n+j] * gradients[i*n+j] // take dot product
products[i] = dot
step = 1, dimension = 0
while step < 2n:
for i in [0, 2n] incrementing by 2*step:
f = point[dimension] // our fractional displacement
f = f*f*f*(f*(f*6-15)+10) // apply fade function
products[i] = products[i] + f * (products[i+step] - products[i])
step *= 2
dimension++ // move to the next dimension
return products[0] // return the collapsed dot product
As is, this is a very inefficient implementation of the noise function in terms of memory,
with each point requiring two gradients vector can be used to hold
the dot products while a set-length vector loads the current gradient per each edge. In the linear
interpolation step, we can collapse each interpolation along the steps in the gradients vector,
holding the dot products until the final interpolation rests at gradients[0]. Additionally, we can
forgo creating a distances vector as distances are only used once per edge, and instead, we can
use their formula implicitly while calculating dot products. The following shows an improved
version of the pseudocode using trivial memory optimizations.
perlin(point):
gradients = empty list
srand with hash[(blockDim.x * blockIdx.x + threadIdx.x)%256]
// calculate gradients
for i in [0, 2n): // loop through edges
current_gradient = empty list
euclidean_distance = 0
for j in [0, n):
current_gradent[j] = rand in (-1, 1)
// calculate distances and dot products
edge = ((i & (1 << -1)) >> - 1);
gradients[i] = (current_gradient[0] / euclidean_distance) * (point[0]-edge); // unroll first addition operation into initialization
for j in [1, n):
edge = ((i & (1 << (j - 1))) >> (j - 1)); // get the jth digit of binary i
gradients[i] += (point[j]- edge) * (current_gradient[j]/euclidean_distance)
// linear interpolation
step = 1, dimension = 0
while step < 2n:
for i in [0, 2n) incrementing by 2*step:
f = point[dimension] // our fractional displacement
f = f*f*f*(f*(f*6-15)+10) // apply fade function
gradients[i] = gradients[i] + f * (gradients[i+step] - gradients[i])
step *= 2
dimension++ // move to the next dimension
return gradients[0]
For measuring runtimes, we ran our code with the following system specifications:
- CPU: AMD Ryzen 7 4800H
- GPU: NVIDIA GeForce RTX 3060 Laptop
- Compute Capability [22]: 8.6
- Architecture [23]: Ampere
- 30 Streaming Multiprocessors
- Software: Visual Studio 2022 v. 17.9.3, CUDA v. 12.3
- CPU Memory: 16 GB RAM, 475 GB HDD
- GPU Memory: 6 GB, 192-bit Memory Interface Width
For running the profiler, we ran our code on a separate system with the following specifications:
- CPU: Intel® Core™ i9-10920X -12 Core -3.5GHz Processor
- GPU: NVIDIA GeForce RTX 3070
- Compute Capability [22]: 8.6
- Architecture: Ampere
- 46 Streaming Multiprocessors
- Software: Nsight Compute v. 2024.1, CUDA v. 12.3
- CPU Memory: 32 GB RAM, 1 TB HDD
- GPU Memory: 8 GB, 256-bit Memory Interface Width
With Compute Capability 8.6 on both systems, we also observe the following technical specifications [24]:
- Maximum 128 resident grids per device.
- Maximum 16 resident blocks, 1536 resident threads, and 48 resident warps per SM.
- Maximum 64K registers per thread block, 255 registers per thread.
- For 1-D grids/blocks:
-
$2^{31}-1$ maximum blocks per grid. - 1024 maximum threads per block.
- Warp size 32.
With this in mind, our system should be able to host over
One of the primary limitations on the feasibility of parallelizing Perlin noise is the space
complexity needed to hold all relevant data in thread registers. While the baseline matrix of size
m and dimensionality n will only generate
Table 1: Matrix Size (m) & Dimensionality (n) vs. Total Memory Space and Registers per Thread Needed
| 1 | 2 | 3 | 4 | 5 | |
| 10 | 40 B (10 floats) | 800 B (200 floats) | 12 KB (3000 floats) | 160 KB (40,000 floats) | 2 MB (500,000 floats) |
| 100 | 400 B (100 floats) | 80 KB (20,000 floats) | 12 MB ( |
1.6 GB ( |
200 GB ( |
| 500 | 2 KB (500 floats) | 320 KB (80,000 floats) | 1.5 GB ( |
8 GB ( |
625 TB ( |
| 1000 | 4 KB (1000 floats) | 8 MB ( |
12 GB ( |
16 TB ( |
20 PB ( |
| R/T: | 4 | 8 | 14 | 20 | 42 |
With the given optimizations shown in pseudocode, in addition to parallelizing matrix
operations, we aim to produce simple yet effective code that operates on large matrix sizes in
small dimensionalities and medium matrix sizes in higher dimensionalities. In regard to runtime,
we hypothesize that our runtimes of the parallelized noise generation will be much faster than
sequential runtimes and fall close to or within
As this is a runtime- and memory-heavy algorithm at higher dimensionalities, we limited our testing to a matrix size of a maximum side length of 10,000 and a maximum dimensionality of 5. We aimed to test the side lengths 10, 250, 500, 1000, 5000, and 10,000, and the dimensionalities 1, 2, 3, 4, and 5.
The time complexity of the Perlin noise algorithm for a grid of mn points is pow(INT_MAX, 1.0/m) < n), as the math functions necessary to lay out the matrix size will
cause overflow.
Table 2: Total Threads (T) and 512-Thread Blocks (B) Needed for Noise Generation at Different Matrix Sizes and Dimensionalities
| 1 | 2 | 3 | 4 | 5 | |
| 10 | T: 10\nB: 1 | T: 100\nB: 1 | T: 1000\nB: 2 | T: 10,000\nB: 20 | T: 100,000\nB: 196 |
| 100 | T: 100\nB: 1 | T: 10,000\nB: 20 | T: 1,000,000\nB: 1954 | T: 108\nB: 195,312 | T: |
| 250 | T: 250\nB: 1 | T: 62,500\nB: 123 | T: 15,625,000\nB: 30,518 | T: |
T: |
| 500 | T: 500\nB: 1 | T: 250,000\nB: 489 | T: 125,000,000\nB: 244,141 | T: |
T: |
| 1000 | T: 1000\nB: 2 | T: 1,000,000\nB: 1954 | T: 109\nB: 1,953,125\nMatrix overflow due to byte count ( |
T: |
T: |
| 5000 | T: 5000\nB: 10 | T: 25,000,000\nB: 48,829 | T: |
T: |
T: |
| 10,000 | T: 10,000\nB: 20 | T: 108\nB: 195,312 | T: |
T: |
T: |
To act as a proper generation of Perlin noise, our algorithm needs to satisfy three conditions:
- Noise values must be uniformly distributed within the range of -1.0 to 1.0.
- Noise values must have overall similarity to neighbors.
- Noise must appear random on a small scale, but follow similar patterns on larger scales.
To preserve our noise points and ensure our algorithm is valid, our CUDA code outputs
to a .txt file, perlin_out.txt. This file is formatted so that the first line shows the matrix's size
(
While our gradient vector generation derives from a C implementation of the algorithm,
different implementations of Perlin noise use different gradient generation algorithms [3][21],
and our use of the
For the feasibility of development time, our visual outputs of perlin noise will be 2-dimensional only. We can represent higher-dimensional noise matrices in the 2-dimensional space as collapsed grids. Still, these grids will not accurately show the adjacency of noise points in higher dimensions and will exist solely for verification purposes of algorithmic integrity.
Figures 3, 4, 5, and 6 show the pixel plots of noise generations for grid sizes
Figures 3, 4, 5, & 6: Noise matrices for sizes 102, 1002, 5002, and 10002
Noise values do appear to be uniformly distributed in the range of -1.0 to 1.0. At
Figure 7: Example of ideal generated Perlin noise
Table 3 shows the actual kernel configuration of the CUDA kernels at different matrix sizes and dimensionalities. As previously mentioned, we encountered unexpected errors when running at thread counts of over 512, despite having a maximum of 1024 threads per block on our systems. It is unknown why these errors occurred.
Table 3: Kernel Configurations at Different Matrix Sizes + Dimensionalities
Table 4 shows the runtimes in seconds of the noise generation algorithm to 3 significant digits, as implemented in a sequential rather than parallel fashion, not including the memory operations required for initializing and freeing the noise vector.
Table 4: Sequential runtime (secs.) of Perlin noise algorithm
Table 5 shows the runtimes in seconds of the parallelized noise generation algorithm without memory or file-write operations to 3 significant digits. Table 6 shows the runtimes of the same algorithm with the memory and file-write operations to 3 significant digits. We cannot proceed with performance tests on layouts that are infeasible in the scope of block and thread limitations or that produce errors due to memory space limitations; these are indicated in Tables 4 and 5.
Table 5: Parallel runtime (secs.) of Perlin noise algorithm (no memory/file-write operations)
Table 6: Parallel runtime (secs.) of Perlin noise algorithm (including memory/file-write operations)
In 4 dimensions, our code always produced runtime errors in copying memory between device and host code despite vector sizes not being particularly extreme. The noise generation worked in 5 dimensions within the bounds of viable matrix sizes, and as such, it is challenging to determine the cause of these errors. The errors do appear to be specific to the cudaMemcpy operation, apparently as an issue with the arguments provided, despite no such errors occurring with the same arguments in different dimensionalities.
Memory operations and file-writing incurred a very significant runtime overhead in our parallelized implementation. Figure 8 shows the difference in performance between the runtimes with memory operations not included versus included.
Figure 8: Total matrix size ($m^n$) vs runtime (secs.) of parallel noise generation with and without memory operations
Overall, the CUDA implementation is much faster than the sequential implementation at
large dimensionalities and matrix sizes. Figure 9 shows the runtimes of sequential and parallel
executions (without memory operations) at different total matrix sizes (
Figure 9: Sequential vs. CUDA runtimes (secs.) at different total matrix sizes ($m^n$).
Profiling was performed on a computer with an NVIDIA RTX 3070 GPU. Though we
were able to gather data for
Figure 10 & 11: Memory and DRAM Throughput at different matrix sizes
Figure 12: DRAM Frequency at different matrix sizes
Memory and DRAM throughput followed similar patterns, as DRAM throughput
contributes to memory. Meanwhile, DRAM frequency significantly increases at matrix size 400
for
Figure 13: Streaming Multiprocessor Frequency at different matrix sizes
Figure 14: Compute (SM) Throughput at different matrix sizes
In most scenarios, Streaming Multiprocessor (SM) Frequency and Compute (SM)
Throughput should have a direct relationship, but as seen in Figures 13 and 14, these two metrics
have an inverse relationship where we can see reversed trends from one as the other increases or
decreases. Each kernel's task and computation algorithm remains the same regardless of matrix,
grid, or block size. This information helps narrow down possible causes for this issue, since
block and grid size were calculated based on matrix size, each increasing incrementally by 100
from matrix sizes 100 to 500 for
Figure 15 & 16: Achieved occupancy and warps at different matrix sizes
Figures 17 & 18: L1 and L2 Cache Throughput with Increasing Matrix Size
In terms of occupancy and cache throughput shown in Figures 15, 16, 17, and 18, we can
see a general increasing trend between these metrics for
Figure 19: Runtime comparison between n = 2 and n = 3, with table
Lastly, for runtime in the profiler, we notice that duration for n = 3 follows the typical growth pattern as the size of the matrix increases, but for n = 2, we can see a small spike at m = 400. This pattern concurs with the previous profiling results, where m = 400 induces a spike in the pattern. Due to the scaling of the graph, the jump in runtime may not be evident. Still, from the table to the right of Figure 11, we can see that the runtime increased by over 400% from m = 300, whereas previous increments only increased duration by a little over 300%. This increase is also seen to fall off and decrease for increments of matrix size for m > 400.
With the limitations encountered in this study, it appears infeasible to generate
Our runtimes show poorer performance of the parallelized algorithm than sequential at
low
Though results from the profiler were fairly much consistent with the expected patterns
for increasing matrix size (and therefore overall thread count), there are two notable anomalies.
First is the spikes occurring at
The second anomaly is the inverse relationship between SM Frequency and Compute
Throughput. Under normal circumstances, these two variables should share a direct relationship.
Our standing theory as to why we observe this inverse relationship is that it ties into the same
issue behind the 512 thread limit and
While our current priority is scoping out the source of the cudaMemcpy errors for
One easy optimization that would improve both per-thread memory usage and smoothing of the noise matrix would be to globalize the gradient vector. Each thread would still load one gradient to the global memory, but by using global memory, we can use the registers used per thread and prevent conflict over resources that might result in reduced occupancy.
Another optimization could be a tiled approach, with threads handling multiple points and utilizing shared memory to load the current gradients. Grid points could also be moved to shared memory, but they contribute much less towards the overall memory usage than the gradient vectors. Using shared memory would be a tradeoff between the number of potential cache loads and the register usage, but with our memory constraints, it may be a worthwhile tradeoff.
One final optimization- perhaps the most prudent- would be loading a stencil of 256
gradients to global memory, which can be reused at given increments of points. Because
gradients are determined by a repeating hash function and, therefore, repeat every 256 values,
there is little point in having a full vector of
[1] K. Perlin. “An image synthesizer.” ACM SIGGRAPH Computer Graphics, vol. 9, no. 3, 1 Jul. 1985, pp. 287-296, doi: 10.1145/325165.325247.
[2] K. Perlin. “Chapter 4: In the beginning: The Pixel Stream Editor.” In SIGGRAPH 2002 Course 36 Notes, 2001.
[3] S. Gustavson. “Simplex noise demystified.” [Online]. Available: https://web.archive.org/web/20230310204125/https://webstaff.itn.liu.se/~stegu/simplexnoise/simplexnoise.pdf.
[4] T. R. Etherington. “Perlin noise as a hierarchical neutral landscape model.” Web Ecology, vol. 22, no. 1, 2022, pp. 1-6, doi: https://doi.org/10.5194/we-22-1-2022.
[5] F. Gürler and E. Onbaşioğlu, "Applying Perlin Noise on 3D Hexagonal Tiled Maps," 2022 International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 2022, pp. 670-673, doi: 10.1109/ISMSIT56059.2022.9932712.
[6] S. Ahmed and B. Pandey. “Procedural Terrain Generation by Sampling a 2D Monochrom Perlin Noise Map in Unity.” Asian Journal of Research in Computer Science, vol. 16, no. 1, 2023, pp. 37-42, doi: 10.9734/ajrcos/2023/v16i1333.
[7] D. Jakes, K. Burrage, C. C. Drovandi, P. Burrage, A. Bueno-Orovio, R. W. dos Santos, B. Rodriguez, and B. A. J. Lawson. “Perlin Noise Generation of Physiologically Realistic Patterns of Fibrosis.” bioRxiv Preprint, Jun. 2019, doi: https://doi.org/10.1101/668848.
[8] A. Alreni, G. Momcheva, and S. Pavlov. “Voronoi Diagrams and Perlin Noise for Simulation of Irregular Artefacts in Microscope Scans.” In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - BIOIMAGING, SciTePress, 2022, pp. 117-122, doi 10.5220/0010833000003123.
[9] S. Michot-Roberto, A. Garcia-Hernández, S. Dopazo-Hilario, and A. Dawson. “The spherical primitive and perlin noise method to recreate realistic aggregate shapes.” Granular Matter, vol. 23, no. 41, 2021, doi: https://doi.org/10.1007/s10035-021-01105-6.
[10] F. Conde-Rodríguez, Á. L. García-Fernández, and J. C. Torres. “Modelling Material Microstructure Using the Perlin Noise Function.” Computer Graphics Forum, vol. 40, no. 1, 2021, pp. 195-208, doi: https://doi.org/10.1111/cgf.14182.
[11] I. Parberry. “Designer worlds: Procedural generation of infinite terrain from real-world elevation data.” Journal of Computer Graphics Techniques, vol. 3, no. 1, 2014.
[12] T. Kaneko and T. Harada. “Noise Robust Generative Adversarial Networks.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2020, pp. 8404-8414.
[13] A. Kurakin, I. Goodfellow, and S. Bengio. “Adversarial Machine Learning at Scale.” arXiv preprint, 2016, doi: https://doi.org/10.48550/arXiv.1611.01236.
[14] A. S. Hashemi and S. Mozaffari. “Secure deep neural networks using adversarial image generation and training with Noise-GAN.” Computers & Security, vol. 86, 2019, pp. 372-387, doi: https://doi.org/10.1016/j.cose.2019.06.012.
[15] T. Y. Liu, Y. Yang, and B. Mirzasoleiman. “Friendly Noise against Adversarial Noise: A Powerful Defense against Data Poisoning Attack.” In Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022, pp.11947-11959.
[16] N. Inoue, E. Yamagata and H. Kataoka, "Initialization Using Perlin Noise for Training Networks with a Limited Amount of Data," 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 2021, pp. 1023-1028, doi: 10.1109/ICPR48806.2021.9412955.
[17] W. Bazuhair and W. Lee, "Detecting Malign Encrypted Network Traffic Using Perlin Noise and Convolutional Neural Network," 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 2020, pp. 0200-0206, doi: 10.1109/CCWC47524.2020.9031116.
[18] H. Bae, C. Kim, N. Kim, B. Park, N. Kim, J. B. Seo, & S. M. Lee. “A Perlin Noise-Based Augmentation Strategy for Deep Learning with Small Data Samples of HRCT Images.” Sci Rep vol. 8, no. 17687, 2018, doi: https://doi.org/10.1038/s41598-018-36047-2.
[19] H. Li, X. Tuo, Y. Liu, and X. Jiang. “A Parallel Algorithm Using Perlin Noise Superposition Method for Terrain Generation Based on CUDA architecture.” In Proceedings of the 2015 International Conference on Materials Engineering and Information Technology Applications, Aug. 2015, pp. 967-974, doi: 10.2991/meita-15.2015.183.
[20] E. Skejić, D. Demirović, and D. Begić. “Evaluation of Perlin Noise using NVIDIA CUDA Platform.” Elektrotehniski Vestnik, vol.. 87, no. 5, 2020, pp. 260-266.
[21] K. Perlin “Chapter 2: Noise Hardware.” In SIGGRAPH 2002 Course 36 Notes, 2002.
[22] “Your GPU Compute Capability.” NVIDIA Developer. [Online]. Available: https://developer.nvidia.com/cuda-gpus.
[23] “NVIDIA GeForce RTX 3060 Mobile.” TechPowerUp. [Online]. Available: https://www.techpowerup.com/gpu-specs/geforce-rtx-3060-mobile.c3757.
[24] “CUDA C++ Programming Guide.” CUDA, 2 Mar. 2024. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability.