Try fast linear indexes for KA by vchuravy · Pull Request #2612 · JuliaGPU/CUDA.jl

vchuravy · 2025-01-09T12:49:35Z

I am unsure why we couldn't do that in the beginning.

maleadt · 2025-01-09T13:12:20Z

Will have to wait for #2593 to get merged.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `c23ab7f`	Previous: `14ae82d`	Ratio
`latency/precompile`	`45424277593.5` ns	`45345622059` ns	`1.00`
`latency/ttfp`	`6384836106.5` ns	`6434638936` ns	`0.99`
`latency/import`	`3034614553` ns	`3051828695.5` ns	`0.99`
`integration/volumerhs`	`9567744` ns	`9568259` ns	`1.00`
`integration/byval/slices=1`	`146777` ns	`146590` ns	`1.00`
`integration/byval/slices=3`	`425370` ns	`425823` ns	`1.00`
`integration/byval/reference`	`144803` ns	`144766` ns	`1.00`
`integration/byval/slices=2`	`286177` ns	`286423` ns	`1.00`
`integration/cudadevrt`	`103434` ns	`103488.5` ns	`1.00`
`kernel/indexing`	`13938` ns	`14282.5` ns	`0.98`
`kernel/indexing_checked`	`15060` ns	`15333` ns	`0.98`
`kernel/occupancy`	`701.1063829787234` ns	`720.4492753623189` ns	`0.97`
`kernel/launch`	`2124.1111111111113` ns	`2130.5` ns	`1.00`
`kernel/rand`	`16334` ns	`17397` ns	`0.94`
`array/reverse/1d`	`19520` ns	`19471` ns	`1.00`
`array/reverse/2d`	`24603` ns	`24536` ns	`1.00`
`array/reverse/1d_inplace`	`10031.666666666666` ns	`10836.333333333334` ns	`0.93`
`array/reverse/2d_inplace`	`11528` ns	`11284` ns	`1.02`
`array/copy`	`20270` ns	`20310` ns	`1.00`
`array/iteration/findall/int`	`159097` ns	`158042` ns	`1.01`
`array/iteration/findall/bool`	`139369` ns	`138224` ns	`1.01`
`array/iteration/findfirst/int`	`153853` ns	`154038.5` ns	`1.00`
`array/iteration/findfirst/bool`	`154627.5` ns	`155126` ns	`1.00`
`array/iteration/scalar`	`75657` ns	`76714` ns	`0.99`
`array/iteration/logical`	`207799` ns	`214056.5` ns	`0.97`
`array/iteration/findmin/1d`	`41128` ns	`41628` ns	`0.99`
`array/iteration/findmin/2d`	`94766` ns	`94463` ns	`1.00`
`array/reductions/reduce/1d`	`38659` ns	`51305` ns	`0.75`
`array/reductions/reduce/2d`	`44155.5` ns	`42302` ns	`1.04`
`array/reductions/mapreduce/1d`	`37246.5` ns	`44898.5` ns	`0.83`
`array/reductions/mapreduce/2d`	`51913.5` ns	`52966.5` ns	`0.98`
`array/broadcast`	`21698` ns	`21607` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`11663` ns	`13399` ns	`0.87`
`array/copyto!/cpu_to_gpu`	`213248` ns	`213579.5` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`246883` ns	`245985.5` ns	`1.00`
`array/accumulate/1d`	`108538` ns	`109003` ns	`1.00`
`array/accumulate/2d`	`79961` ns	`79807` ns	`1.00`
`array/construct`	`1197.35` ns	`1147.9` ns	`1.04`
`array/random/randn/Float32`	`43009` ns	`43138` ns	`1.00`
`array/random/randn!/Float32`	`26240` ns	`26215` ns	`1.00`
`array/random/rand!/Int64`	`27084` ns	`27096` ns	`1.00`
`array/random/rand!/Float32`	`8824.666666666666` ns	`8869.333333333334` ns	`0.99`
`array/random/rand/Int64`	`29684` ns	`29884` ns	`0.99`
`array/random/rand/Float32`	`12772` ns	`12925` ns	`0.99`
`array/permutedims/4d`	`65015` ns	`67255` ns	`0.97`
`array/permutedims/2d`	`56278` ns	`56783` ns	`0.99`
`array/permutedims/3d`	`60503.5` ns	`58969.5` ns	`1.03`
`array/sorting/1d`	`2920400.5` ns	`2933376.5` ns	`1.00`
`array/sorting/by`	`3499981` ns	`3499572.5` ns	`1.00`
`array/sorting/2d`	`1084450` ns	`1084491.5` ns	`1.00`
`cuda/synchronization/stream/auto`	`1027.8` ns	`1039.3` ns	`0.99`
`cuda/synchronization/stream/nonblocking`	`6532.2` ns	`6569.6` ns	`0.99`
`cuda/synchronization/stream/blocking`	`804.8350515463917` ns	`796.7647058823529` ns	`1.01`
`cuda/synchronization/context/auto`	`1166.8` ns	`1224.5` ns	`0.95`
`cuda/synchronization/context/nonblocking`	`6741.4` ns	`6745.4` ns	`1.00`
`cuda/synchronization/context/blocking`	`891.9583333333334` ns	`915.2391304347826` ns	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2025-01-09T17:26:52Z

A handful of tests fail on this PR:

julia> CUDA.@sync sum!(CUDA.rand(Float64, (1, 1025, 2)), CUDA.rand(Float64, (1, 1025, 2)))
ERROR: a BoundsError was thrown during kernel execution on thread (513, 1, 1) in block (3, 1, 1).
Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/tim/Julia/pkg/CUDA/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

ERROR: KernelException: exception thrown during kernel execution on device NVIDIA RTX 6000 Ada Generation
Stacktrace:
 [1] check_exceptions()
   @ CUDA ~/Julia/pkg/CUDA/src/compiler/exceptions.jl:39
 [2] synchronize(stream::CuStream; blocking::Bool, spin::Bool)
   @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/synchronization.jl:207
 [3] synchronize
   @ ~/Julia/pkg/CUDA/lib/cudadrv/synchronization.jl:194 [inlined]
 [4] macro expansion
   @ ~/Julia/pkg/CUDA/src/utilities.jl:36 [inlined]
 [5] top-level scope
   @ REPL[44]:1

Not a catastrophic amount though, so probably worth looking into?

Test Summary:                                  |  Pass  Error  Broken  Total  Time
  Overall                                      | 25854     15      11  25880

maleadt · 2025-01-09T17:56:58Z

Not [...] catastrophic though

Well, not so sure about that

julia> CUDA.@sync CUDA.ones(Float64, (1, 1025, 2));
ERROR: a BoundsError was thrown during kernel execution on thread (353, 1, 1) in block (4, 1, 1).
Out-of-bounds array access
Stacktrace:
 [1] throw_boundserror at /home/tim/Julia/pkg/CUDA/src/device/quirks.jl:15
 [2] multiple call sites at unknown:0

maleadt · 2025-01-09T18:12:23Z

MWE for the bounds error:

function main()
    A = CuArray{Float64}(undef, (1, 1025, 2))

    @kernel function fill_kernel!(a)
        idx = @index(Global, Linear)
        if idx >= length(a)
            if idx == length(a)+1
                @cushow threadIdx().x blockDim().x blockIdx().x gridDim().x idx
            end
        else
            a[idx] = 0f0
        end
    end

    kernel = fill_kernel!(get_backend(A))
    CUDA.@sync kernel(A; ndrange = size(A))
end

The linear index here goes out of bounds for a lot of threads, so I limited to only printing about the first one:

(threadIdx()).x = 259
(blockDim()).x = 896
(blockIdx()).x = 3
(gridDim()).x = 4
idx = 2051

The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no? In any case, it's also strange that this isn't detected by the bounds check I presume @index is doing...

maleadt · 2025-01-09T18:21:37Z

The launch configuration is strange: 4 blocks of 896 threads covers 3584 items, while 3 blocks would have been sufficient by covering 2688 out of 2050, no?

This is how KA's launch configuration determines that:

ndrange = (1, 1025, 2)
config.threads = 896
workgroupsize = threads_to_workgroupsize(threads, ndrange) = (1, 896, 1)
KA.blocks(iterspace) = CartesianIndices((1, 2, 2))
KA.workitems(iterspace) = CartesianIndices((1, 896, 1))
blocks = length(KA.blocks(iterspace)) = 4
threads = length(KA.workitems(iterspace)) = 896

Regardless of the (somehow) missing bounds check here, it seems very wasteful to launch (1, 2, 2) = 4 groups to cover (1, 1025, 2) items if we end up generating a ND index from a linear one anyway (i.e., without having to cover each dimension individually)...

@vchuravy I'll defer to you on this.

maleadt · 2025-01-28T11:32:52Z

@vchuravy Any update here? I was thinking of doing the same in OpenCL, where we even have a get_global_id.

christiangnrd · 2025-01-31T13:22:55Z

These failures look a lot like the failures that JuliaGPU/Metal.jl#496 fixed in Metal.jl.

github-actions Bot reviewed Jan 9, 2025

View reviewed changes

Try fast linear indexes for KA

2a2b844

maleadt force-pushed the vc/fast_linear_global branch from c23ab7f to 2a2b844 Compare January 9, 2025 15:35

maleadt mentioned this pull request Jan 13, 2025

KA.jl-related slowdowns JuliaGPU/GPUArrays.jl#565

Open

maleadt added enhancement New feature or request performance How fast can we go? needs changes Changes are needed. labels Feb 4, 2025

maleadt mentioned this pull request Jul 29, 2025

Hardware-accelerated linear KA index is wrong JuliaGPU/OpenCL.jl#346

Open

maleadt force-pushed the master branch from f1e7455 to 5a6f767 Compare March 26, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try fast linear indexes for KA#2612

Try fast linear indexes for KA#2612
vchuravy wants to merge 1 commit intomasterfrom
vc/fast_linear_global

vchuravy commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

github-actions Bot left a comment

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025 •

edited

Loading

Uh oh!

maleadt commented Jan 28, 2025

Uh oh!

christiangnrd commented Jan 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vchuravy commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025

Uh oh!

maleadt commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jan 28, 2025

Uh oh!

christiangnrd commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maleadt commented Jan 9, 2025 •

edited

Loading

christiangnrd commented Jan 31, 2025 •

edited

Loading