Conversation
|
Will have to wait for #2593 to get merged. |
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: c23ab7f | Previous: 14ae82d | Ratio |
|---|---|---|---|
latency/precompile |
45424277593.5 ns |
45345622059 ns |
1.00 |
latency/ttfp |
6384836106.5 ns |
6434638936 ns |
0.99 |
latency/import |
3034614553 ns |
3051828695.5 ns |
0.99 |
integration/volumerhs |
9567744 ns |
9568259 ns |
1.00 |
integration/byval/slices=1 |
146777 ns |
146590 ns |
1.00 |
integration/byval/slices=3 |
425370 ns |
425823 ns |
1.00 |
integration/byval/reference |
144803 ns |
144766 ns |
1.00 |
integration/byval/slices=2 |
286177 ns |
286423 ns |
1.00 |
integration/cudadevrt |
103434 ns |
103488.5 ns |
1.00 |
kernel/indexing |
13938 ns |
14282.5 ns |
0.98 |
kernel/indexing_checked |
15060 ns |
15333 ns |
0.98 |
kernel/occupancy |
701.1063829787234 ns |
720.4492753623189 ns |
0.97 |
kernel/launch |
2124.1111111111113 ns |
2130.5 ns |
1.00 |
kernel/rand |
16334 ns |
17397 ns |
0.94 |
array/reverse/1d |
19520 ns |
19471 ns |
1.00 |
array/reverse/2d |
24603 ns |
24536 ns |
1.00 |
array/reverse/1d_inplace |
10031.666666666666 ns |
10836.333333333334 ns |
0.93 |
array/reverse/2d_inplace |
11528 ns |
11284 ns |
1.02 |
array/copy |
20270 ns |
20310 ns |
1.00 |
array/iteration/findall/int |
159097 ns |
158042 ns |
1.01 |
array/iteration/findall/bool |
139369 ns |
138224 ns |
1.01 |
array/iteration/findfirst/int |
153853 ns |
154038.5 ns |
1.00 |
array/iteration/findfirst/bool |
154627.5 ns |
155126 ns |
1.00 |
array/iteration/scalar |
75657 ns |
76714 ns |
0.99 |
array/iteration/logical |
207799 ns |
214056.5 ns |
0.97 |
array/iteration/findmin/1d |
41128 ns |
41628 ns |
0.99 |
array/iteration/findmin/2d |
94766 ns |
94463 ns |
1.00 |
array/reductions/reduce/1d |
38659 ns |
51305 ns |
0.75 |
array/reductions/reduce/2d |
44155.5 ns |
42302 ns |
1.04 |
array/reductions/mapreduce/1d |
37246.5 ns |
44898.5 ns |
0.83 |
array/reductions/mapreduce/2d |
51913.5 ns |
52966.5 ns |
0.98 |
array/broadcast |
21698 ns |
21607 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11663 ns |
13399 ns |
0.87 |
array/copyto!/cpu_to_gpu |
213248 ns |
213579.5 ns |
1.00 |
array/copyto!/gpu_to_cpu |
246883 ns |
245985.5 ns |
1.00 |
array/accumulate/1d |
108538 ns |
109003 ns |
1.00 |
array/accumulate/2d |
79961 ns |
79807 ns |
1.00 |
array/construct |
1197.35 ns |
1147.9 ns |
1.04 |
array/random/randn/Float32 |
43009 ns |
43138 ns |
1.00 |
array/random/randn!/Float32 |
26240 ns |
26215 ns |
1.00 |
array/random/rand!/Int64 |
27084 ns |
27096 ns |
1.00 |
array/random/rand!/Float32 |
8824.666666666666 ns |
8869.333333333334 ns |
0.99 |
array/random/rand/Int64 |
29684 ns |
29884 ns |
0.99 |
array/random/rand/Float32 |
12772 ns |
12925 ns |
0.99 |
array/permutedims/4d |
65015 ns |
67255 ns |
0.97 |
array/permutedims/2d |
56278 ns |
56783 ns |
0.99 |
array/permutedims/3d |
60503.5 ns |
58969.5 ns |
1.03 |
array/sorting/1d |
2920400.5 ns |
2933376.5 ns |
1.00 |
array/sorting/by |
3499981 ns |
3499572.5 ns |
1.00 |
array/sorting/2d |
1084450 ns |
1084491.5 ns |
1.00 |
cuda/synchronization/stream/auto |
1027.8 ns |
1039.3 ns |
0.99 |
cuda/synchronization/stream/nonblocking |
6532.2 ns |
6569.6 ns |
0.99 |
cuda/synchronization/stream/blocking |
804.8350515463917 ns |
796.7647058823529 ns |
1.01 |
cuda/synchronization/context/auto |
1166.8 ns |
1224.5 ns |
0.95 |
cuda/synchronization/context/nonblocking |
6741.4 ns |
6745.4 ns |
1.00 |
cuda/synchronization/context/blocking |
891.9583333333334 ns |
915.2391304347826 ns |
0.97 |
This comment was automatically generated by workflow using github-action-benchmark.
c23ab7f to
2a2b844
Compare
|
A handful of tests fail on this PR: Not a catastrophic amount though, so probably worth looking into? |
Well, not so sure about that |
|
MWE for the bounds error: function main()
A = CuArray{Float64}(undef, (1, 1025, 2))
@kernel function fill_kernel!(a)
idx = @index(Global, Linear)
if idx >= length(a)
if idx == length(a)+1
@cushow threadIdx().x blockDim().x blockIdx().x gridDim().x idx
end
else
a[idx] = 0f0
end
end
kernel = fill_kernel!(get_backend(A))
CUDA.@sync kernel(A; ndrange = size(A))
endThe linear index here goes out of bounds for a lot of threads, so I limited to only printing about the first one: The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no? In any case, it's also strange that this isn't detected by the bounds check I presume |
This is how KA's launch configuration determines that: Regardless of the (somehow) missing bounds check here, it seems very wasteful to launch @vchuravy I'll defer to you on this. |
|
@vchuravy Any update here? I was thinking of doing the same in OpenCL, where we even have a |
|
These failures look a lot like the failures that JuliaGPU/Metal.jl#496 fixed in Metal.jl. |
I am unsure why we couldn't do that in the beginning.