Skip to content

cuTENSOR: Preserve storage type when multiplying #2775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 9, 2025

Conversation

christiangnrd
Copy link
Member

This is the only instance I could find in cuTENSOR where the responsibility of the output storage type is on the library.

I would appreciate a second look however.

@maleadt
Copy link
Member

maleadt commented May 8, 2025

Multiplying tensors is pretty common, so it's likely that this is the case @OliverDudgeon ran into.

I do wonder if we should add a buffer typevar to the CuTensor type. It's not needed for functionality (we can just take it from the contained array), but would make the CuArray field fully-typed. Maybe that's not worth it given the heavyweight nature of the operations applied to CuTensor objects.

@maleadt maleadt added enhancement New feature or request cuda libraries Stuff about CUDA library wrappers. labels May 8, 2025
@maleadt maleadt changed the title cuTENSOR storage type fix cuTENSOR: Preserve storage type when multiplying May 8, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Benchmark suite Current: ebab590 Previous: 82c2074 Ratio
latency/precompile 42980450862.5 ns 42798778214.5 ns 1.00
latency/ttfp 7130974552 ns 7189648330 ns 0.99
latency/import 3422341012 ns 3448929760 ns 0.99
integration/volumerhs 9605727 ns 9608526 ns 1.00
integration/byval/slices=1 146901 ns 147048 ns 1.00
integration/byval/slices=3 425496 ns 425659 ns 1.00
integration/byval/reference 145145 ns 145118 ns 1.00
integration/byval/slices=2 286258 ns 286478.5 ns 1.00
integration/cudadevrt 103406 ns 103554 ns 1.00
kernel/indexing 14335 ns 14396 ns 1.00
kernel/indexing_checked 15224 ns 15267 ns 1.00
kernel/occupancy 717.7214285714285 ns 705.917808219178 ns 1.02
kernel/launch 2319.8888888888887 ns 2478.3333333333335 ns 0.94
kernel/rand 17485 ns 14849 ns 1.18
array/reverse/1d 19940 ns 19642 ns 1.02
array/reverse/2d 24054.5 ns 25359 ns 0.95
array/reverse/1d_inplace 10603 ns 11514 ns 0.92
array/reverse/2d_inplace 12165 ns 12988 ns 0.94
array/copy 21558 ns 21283 ns 1.01
array/iteration/findall/int 159770 ns 158862.5 ns 1.01
array/iteration/findall/bool 139706 ns 139368 ns 1.00
array/iteration/findfirst/int 164573 ns 162842 ns 1.01
array/iteration/findfirst/bool 165091.5 ns 164699.5 ns 1.00
array/iteration/scalar 74481.5 ns 72904 ns 1.02
array/iteration/logical 218609 ns 218588 ns 1.00
array/iteration/findmin/1d 48018 ns 48297 ns 0.99
array/iteration/findmin/2d 99218.5 ns 98436 ns 1.01
array/reductions/reduce/1d 36200 ns 43805.5 ns 0.83
array/reductions/reduce/2d 42507 ns 52620 ns 0.81
array/reductions/mapreduce/1d 34419 ns 40094.5 ns 0.86
array/reductions/mapreduce/2d 41763.5 ns 51319 ns 0.81
array/broadcast 21106 ns 21139 ns 1.00
array/copyto!/gpu_to_gpu 12937 ns 11014 ns 1.17
array/copyto!/cpu_to_gpu 219200 ns 216920 ns 1.01
array/copyto!/gpu_to_cpu 285143 ns 286440.5 ns 1.00
array/accumulate/1d 109706 ns 110134 ns 1.00
array/accumulate/2d 80932 ns 81297 ns 1.00
array/construct 1266.9 ns 1331.6999999999998 ns 0.95
array/random/randn/Float32 48153.5 ns 44531.5 ns 1.08
array/random/randn!/Float32 25132 ns 25102 ns 1.00
array/random/rand!/Int64 27336 ns 27335 ns 1.00
array/random/rand!/Float32 8755 ns 8881.833333333332 ns 0.99
array/random/rand/Int64 34510 ns 30496 ns 1.13
array/random/rand/Float32 13273 ns 13415 ns 0.99
array/permutedims/4d 61469 ns 61709 ns 1.00
array/permutedims/2d 55555 ns 55698 ns 1.00
array/permutedims/3d 56369 ns 56496 ns 1.00
array/sorting/1d 2778468 ns 2777538 ns 1.00
array/sorting/by 3370277 ns 3368839 ns 1.00
array/sorting/2d 1086517.5 ns 1086273.5 ns 1.00
cuda/synchronization/stream/auto 1037.4545454545455 ns 1004.5714285714286 ns 1.03
cuda/synchronization/stream/nonblocking 8026.6 ns 8095 ns 0.99
cuda/synchronization/stream/blocking 845.3316326530612 ns 844.8709677419355 ns 1.00
cuda/synchronization/context/auto 1187.1 ns 1158 ns 1.03
cuda/synchronization/context/nonblocking 8035 ns 7052.9 ns 1.14
cuda/synchronization/context/blocking 908.8333333333334 ns 884.25 ns 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

codecov bot commented May 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.77%. Comparing base (71bc923) to head (ebab590).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2775      +/-   ##
==========================================
+ Coverage   89.72%   89.77%   +0.05%     
==========================================
  Files         153      153              
  Lines       13228    13228              
==========================================
+ Hits        11869    11876       +7     
+ Misses       1359     1352       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt merged commit bb8259f into JuliaGPU:master May 9, 2025
3 checks passed
@christiangnrd christiangnrd deleted the storage branch May 9, 2025 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants