-
Notifications
You must be signed in to change notification settings - Fork 241
cuTENSOR: Preserve storage type when multiplying #2775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Multiplying tensors is pretty common, so it's likely that this is the case @OliverDudgeon ran into. I do wonder if we should add a buffer typevar to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.jl Benchmarks
Benchmark suite | Current: ebab590 | Previous: 82c2074 | Ratio |
---|---|---|---|
latency/precompile |
42980450862.5 ns |
42798778214.5 ns |
1.00 |
latency/ttfp |
7130974552 ns |
7189648330 ns |
0.99 |
latency/import |
3422341012 ns |
3448929760 ns |
0.99 |
integration/volumerhs |
9605727 ns |
9608526 ns |
1.00 |
integration/byval/slices=1 |
146901 ns |
147048 ns |
1.00 |
integration/byval/slices=3 |
425496 ns |
425659 ns |
1.00 |
integration/byval/reference |
145145 ns |
145118 ns |
1.00 |
integration/byval/slices=2 |
286258 ns |
286478.5 ns |
1.00 |
integration/cudadevrt |
103406 ns |
103554 ns |
1.00 |
kernel/indexing |
14335 ns |
14396 ns |
1.00 |
kernel/indexing_checked |
15224 ns |
15267 ns |
1.00 |
kernel/occupancy |
717.7214285714285 ns |
705.917808219178 ns |
1.02 |
kernel/launch |
2319.8888888888887 ns |
2478.3333333333335 ns |
0.94 |
kernel/rand |
17485 ns |
14849 ns |
1.18 |
array/reverse/1d |
19940 ns |
19642 ns |
1.02 |
array/reverse/2d |
24054.5 ns |
25359 ns |
0.95 |
array/reverse/1d_inplace |
10603 ns |
11514 ns |
0.92 |
array/reverse/2d_inplace |
12165 ns |
12988 ns |
0.94 |
array/copy |
21558 ns |
21283 ns |
1.01 |
array/iteration/findall/int |
159770 ns |
158862.5 ns |
1.01 |
array/iteration/findall/bool |
139706 ns |
139368 ns |
1.00 |
array/iteration/findfirst/int |
164573 ns |
162842 ns |
1.01 |
array/iteration/findfirst/bool |
165091.5 ns |
164699.5 ns |
1.00 |
array/iteration/scalar |
74481.5 ns |
72904 ns |
1.02 |
array/iteration/logical |
218609 ns |
218588 ns |
1.00 |
array/iteration/findmin/1d |
48018 ns |
48297 ns |
0.99 |
array/iteration/findmin/2d |
99218.5 ns |
98436 ns |
1.01 |
array/reductions/reduce/1d |
36200 ns |
43805.5 ns |
0.83 |
array/reductions/reduce/2d |
42507 ns |
52620 ns |
0.81 |
array/reductions/mapreduce/1d |
34419 ns |
40094.5 ns |
0.86 |
array/reductions/mapreduce/2d |
41763.5 ns |
51319 ns |
0.81 |
array/broadcast |
21106 ns |
21139 ns |
1.00 |
array/copyto!/gpu_to_gpu |
12937 ns |
11014 ns |
1.17 |
array/copyto!/cpu_to_gpu |
219200 ns |
216920 ns |
1.01 |
array/copyto!/gpu_to_cpu |
285143 ns |
286440.5 ns |
1.00 |
array/accumulate/1d |
109706 ns |
110134 ns |
1.00 |
array/accumulate/2d |
80932 ns |
81297 ns |
1.00 |
array/construct |
1266.9 ns |
1331.6999999999998 ns |
0.95 |
array/random/randn/Float32 |
48153.5 ns |
44531.5 ns |
1.08 |
array/random/randn!/Float32 |
25132 ns |
25102 ns |
1.00 |
array/random/rand!/Int64 |
27336 ns |
27335 ns |
1.00 |
array/random/rand!/Float32 |
8755 ns |
8881.833333333332 ns |
0.99 |
array/random/rand/Int64 |
34510 ns |
30496 ns |
1.13 |
array/random/rand/Float32 |
13273 ns |
13415 ns |
0.99 |
array/permutedims/4d |
61469 ns |
61709 ns |
1.00 |
array/permutedims/2d |
55555 ns |
55698 ns |
1.00 |
array/permutedims/3d |
56369 ns |
56496 ns |
1.00 |
array/sorting/1d |
2778468 ns |
2777538 ns |
1.00 |
array/sorting/by |
3370277 ns |
3368839 ns |
1.00 |
array/sorting/2d |
1086517.5 ns |
1086273.5 ns |
1.00 |
cuda/synchronization/stream/auto |
1037.4545454545455 ns |
1004.5714285714286 ns |
1.03 |
cuda/synchronization/stream/nonblocking |
8026.6 ns |
8095 ns |
0.99 |
cuda/synchronization/stream/blocking |
845.3316326530612 ns |
844.8709677419355 ns |
1.00 |
cuda/synchronization/context/auto |
1187.1 ns |
1158 ns |
1.03 |
cuda/synchronization/context/nonblocking |
8035 ns |
7052.9 ns |
1.14 |
cuda/synchronization/context/blocking |
908.8333333333334 ns |
884.25 ns |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2775 +/- ##
==========================================
+ Coverage 89.72% 89.77% +0.05%
==========================================
Files 153 153
Lines 13228 13228
==========================================
+ Hits 11869 11876 +7
+ Misses 1359 1352 -7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This is the only instance I could find in cuTENSOR where the responsibility of the output storage type is on the library.
I would appreciate a second look however.