Make `resize!` run faster #2828

huiyuxie · 2025-07-30T16:20:02Z

I need resize! function run faster when it is called frequently.

I'm not sure about whether 2 is a good resize factor but I think it might be a reasonable number. And I do the benchmarks based on the assumption that the resize length is uniformly distributed within a range.

Click for benchmark script

using CUDA
using BenchmarkTools
using Random

# we assume resize length is uniformly distributed within a range

# case 1: range 1 to 100
a1 = CUDA.rand(50)
function old1(a::CuArray, rng)
    size = rand(rng, 1:100)
    CUDA.resize!(a, size)
end
function new1(a::CuArray, rng)
    size = rand(rng, 1:100)
    CUDA.new_resize!(a, size)
end
seed = 1
rng1a = MersenneTwister(seed)
ben1a = @benchmark CUDA.@sync old1(a1, rng1a)
rng1b = MersenneTwister(seed)
ben1b = @benchmark CUDA.@sync new1(a1, rng1b)

# case 2: range 1 to 1000
a2 = CUDA.rand(500)
function old2(a::CuArray, rng)
    size = rand(rng, 1:1000)
    CUDA.resize!(a, size)
end
function new2(a::CuArray, rng)
    size = rand(rng, 1:1000)
    CUDA.new_resize!(a, size)
end
seed = 12
rng2a = MersenneTwister(seed)
ben2a = @benchmark CUDA.@sync old2(a2, rng2a)
rng2b = MersenneTwister(seed)
ben2b = @benchmark CUDA.@sync new2(a2, rng2b)

# case 3: range 1 to 10000
a3 = CUDA.rand(5000)
function old3(a::CuArray, rng)
    size = rand(rng, 1:10000)
    CUDA.resize!(a, size)
end
function new3(a::CuArray, rng)
    size = rand(rng, 1:10000)
    CUDA.new_resize!(a, size)
end
seed = 123
rng3a = MersenneTwister(seed)
ben3a = @benchmark CUDA.@sync old3(a3, rng3a)
rng3b = MersenneTwister(seed)
ben3b = @benchmark CUDA.@sync new3(a3, rng3b)

# case 4: range 1 to 100000
a4 = CUDA.rand(50000)
function old4(a::CuArray, rng)
    size = rand(rng, 1:100000)
    CUDA.resize!(a, size)
end
function new4(a::CuArray, rng)
    size = rand(rng, 1:100000)
    CUDA.new_resize!(a, size)
end
seed = 1234
rng4a = MersenneTwister(seed)
ben4a = @benchmark CUDA.@sync old4(a4, rng4a)
rng4b = MersenneTwister(seed)
ben4b = @benchmark CUDA.@sync new4(a4, rng4b)

# case 5: range 1 to 1000000
a5 = CUDA.rand(500000)
function old5(a::CuArray, rng)
    size = rand(rng, 1:1000000)
    CUDA.resize!(a, size)
end
function new5(a::CuArray, rng)
    size = rand(rng, 1:1000000)
    CUDA.new_resize!(a, size)
end
seed = 12345
rng5a = MersenneTwister(seed)
ben5a = @benchmark CUDA.@sync old5(a5, rng5a)
rng5b = MersenneTwister(seed)
ben5b = @benchmark CUDA.@sync new5(a5, rng5b)

Click for benchmark result

# range 1 to 100

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  400.000 ns …  2.403 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      19.800 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    27.475 μs ± 41.275 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂              ██▅▂▂▂▁      ▁▅▅▂▁▁    ▄▃▂▁                   ▂
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁█████████▇██████████▇███████████▇▆▅▅▅▄▆▅▅▅▅▃▄ █
  400 ns        Histogram: log(frequency) by time      70.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# new
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min … max):  366.667 ns … 897.633 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      17.900 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    17.700 μs ±  17.207 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂      █      ▆
  ▃▁▁▁▁▁▂█▃▂▂▂▂▃█▅▃▃▂▃▄█▇▄▃▃▃▃▄▄▃▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  367 ns           Histogram: frequency by time         51.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# range 1 to 1000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  400.000 ns …  1.352 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      20.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    28.770 μs ± 43.310 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ██▆▄▄▃▂▁▁▁▂▅▄▃▂▁▁▁▄▁▁ ▁▁▁                         ▂
  ▆▁▁▁▁▁▁▁▁▁▁██████████████████████████▇▆▇▆▇▆▇▆▆▆▅▅▅▅▅▆▅▄▆▁▅▃▄ █
  400 ns        Histogram: log(frequency) by time      93.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
# new
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (min … max):   3.650 μs … 324.167 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.643 μs ±  16.182 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃▂█▃▆▁▄
  ▂▃▄▆▆███████▇█▅▄▅▃▃▂▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  3.65 μs         Histogram: frequency by time          107 μs <

 Memory estimate: 85 bytes, allocs estimate: 4.

# range 1 to 10000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  500.000 ns …  1.365 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      21.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    28.338 μs ± 44.612 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                █▇▅▅▃▃▂▂▂▂▁▁▄▄▃▂▁▁  ▁▁▁ ▁                      ▂
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁███████████████████▇█████▇▇▇▆▆▆▅▆▅▃▃▆▃▅▅▁▃▅▃▄▅ █
  500 ns        Histogram: log(frequency) by time        81 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
# new
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min … max):  366.667 ns … 551.667 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      19.367 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    20.171 μs ±  16.419 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▃      █▁     ▆▁
  ▃▂▁▁▁▁▂█▃▂▂▂▂▃██▄▃▃▃▅██▆▄▃▃▄▆▅▄▃▃▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂ ▃
  367 ns           Histogram: frequency by time         54.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# range 1 to 100000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  16.100 μs … 771.100 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.895 μs ±  31.561 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▇▅▄▄▂▂▃▄▃▃▂▁▁▁ ▁                                          ▂
  ███████████████████████▇▇▇▇▇█▇▇▇▇▇▇█▇▇▆▅▅▅▅▄▆▆▄▄▅▅▄▄▃▂▄▄▃▄▂▄ █
  16.1 μs       Histogram: log(frequency) by time       105 μs <

 Memory estimate: 512 bytes, allocs estimate: 24.
# new
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  300.000 ns … 805.600 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      17.700 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):    17.344 μs ±  28.719 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                  ▃▇▇▇▆▅▃▂▁      ▁▂▃▄▃▃▂▁                    ▂
  █▆▃▄▁▄▄▃▁▁▁▁▁▁▁▁▁▁▁████████████▇▇███████████▆▇▆▆▆▆▇███▇▇▇▇▅▅▅ █
  300 ns        Histogram: log(frequency) by time       49.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# range 1 to 1000000

# old
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  11.600 μs … 236.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.556 μs ±  11.217 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▇▄  ▃▇█▇▅▃▂
  ▆█████▅▄███████▇▆▄▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁ ▃
  11.6 μs         Histogram: frequency by time         53.6 μs <

 Memory estimate: 512 bytes, allocs estimate: 24.
# new
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):  375.000 ns …  1.725 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      15.550 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    16.869 μs ± 23.618 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▁▃▆▆▆▆▇▇▇██▇▆▅▅▄▄▂▁▁
  ▃▁▁▁▂▄▄▆▄▅▇▆█████████████████████▇▇▅▆▅▅▄▄▄▄▄▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂ ▅
  375 ns          Histogram: frequency by time         40.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

But it is still not as fast as CPU resize! function.

Do you have any other suggestions to make it run faster (like any other good resize factor)? And do we need to benchmark it based on other assumptions (like the resize length linearly grows or shrinks with the times)? I guess the performance will not be good on some corner cases (like when we keep expanding the GPU array to less than double its length).

I separate the new resize! and the old resize! temporarily for better comparison.

github-actions · 2025-07-30T16:20:41Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/array.jl b/src/array.jl
index 5d184179c..91c59e30a 100644
--- a/src/array.jl
+++ b/src/array.jl
@@ -62,7 +62,7 @@ function valid_type(@nospecialize(T))
 end
 
 @inline function check_eltype(name, T)
-  if !valid_type(T)
+    return if !valid_type(T)
     explanation = explain_eltype(T)
     error("""
       $name only supports element types that are allocated inline.
@@ -877,7 +877,7 @@ Base.unsafe_convert(::Type{CuPtr{T}}, A::PermutedDimsArray) where {T} =
 ## resizing
 
 const RESIZE_THRESHOLD = 100 * 1024^2     # 100 MiB
-const RESIZE_INCREMENT = 32  * 1024^2     # 32  MiB
+const RESIZE_INCREMENT = 32 * 1024^2     # 32  MiB
 
 """
   resize!(a::CuVector, n::Integer)
@@ -889,60 +889,60 @@ guaranteed to be initialized.
 function Base.resize!(A::CuVector{T}, n::Integer) where T
   n == length(A) && return A
 
-  # only resize when the new length exceeds the capacity or is much smaller
-  cap = A.maxsize ÷ aligned_sizeof(T)
-  if n > cap || n < cap ÷ 4
-    len = if n < cap
-      # shrink to fit
-      n
-    elseif A.maxsize > RESIZE_THRESHOLD
-      # large arrays grown by fixed increments
-      max(n, cap + RESIZE_INCREMENT ÷ aligned_sizeof(T))
-    else
-      # small arrays are doubled in size
-      max(n, 2 * length(A))
-    end
+    # only resize when the new length exceeds the capacity or is much smaller
+    cap = A.maxsize ÷ aligned_sizeof(T)
+    if n > cap || n < cap ÷ 4
+        len = if n < cap
+            # shrink to fit
+            n
+        elseif A.maxsize > RESIZE_THRESHOLD
+            # large arrays grown by fixed increments
+            max(n, cap + RESIZE_INCREMENT ÷ aligned_sizeof(T))
+        else
+            # small arrays are doubled in size
+            max(n, 2 * length(A))
+        end
 
-    # determine the new buffer size
-    maxsize = len * aligned_sizeof(T)
-    bufsize = if isbitstype(T)
-        maxsize
-    else
-      # type tag array past the data
-      maxsize + len
-    end
+        # determine the new buffer size
+        maxsize = len * aligned_sizeof(T)
+        bufsize = if isbitstype(T)
+            maxsize
+        else
+            # type tag array past the data
+            maxsize + len
+        end
 
-    # allocate new data
-    old_data = A.data
-    new_data = context!(context(A)) do
-      mem = pool_alloc(memory_type(A), bufsize)
-      ptr = convert(CuPtr{T}, mem)
-      DataRef(pool_free, mem)
-    end
+        # allocate new data
+        old_data = A.data
+        new_data = context!(context(A)) do
+            mem = pool_alloc(memory_type(A), bufsize)
+            ptr = convert(CuPtr{T}, mem)
+            DataRef(pool_free, mem)
+        end
 
-    # replace the data with a new one. this 'unshares' the array.
-    # as a result, we can safely support resizing unowned buffers.
-    old_pointer = pointer(A)
-    old_typetagdata = typetagdata(A)
-    A.data = new_data
-    A.maxsize = maxsize
-    A.offset = 0
-    new_pointer = pointer(A)
-    new_typetagdata = typetagdata(A)
-
-    # copy existing elements and type tags
+        # replace the data with a new one. this 'unshares' the array.
+        # as a result, we can safely support resizing unowned buffers.
+        old_pointer = pointer(A)
+        old_typetagdata = typetagdata(A)
+        A.data = new_data
+        A.maxsize = maxsize
+        A.offset = 0
+        new_pointer = pointer(A)
+        new_typetagdata = typetagdata(A)
+
+        # copy existing elements and type tags
     m = min(length(A), n)
     if m > 0
-      context!(context(A)) do
-        unsafe_copyto!(new_pointer, old_pointer, m; async=true)
-        if Base.isbitsunion(T)
-          unsafe_copyto!(new_typetagdata, old_typetagdata, m; async=true)
-        end
-      end
+            context!(context(A)) do
+                unsafe_copyto!(new_pointer, old_pointer, m; async = true)
+                if Base.isbitsunion(T)
+                    unsafe_copyto!(new_typetagdata, old_typetagdata, m; async = true)
+                end
+            end
+    end
+        unsafe_free!(old_data)
     end
-    unsafe_free!(old_data)
-  end
 
   A.dims = (n,)
-  return A
+    return A
 end
diff --git a/test/base/array.jl b/test/base/array.jl
index 4c0ebca8d..eb641f2bc 100644
--- a/test/base/array.jl
+++ b/test/base/array.jl
@@ -550,41 +550,41 @@ end
 end
 
 @testset "resizing" begin
-  for data in ([1, 2, 3], [1, nothing, 3])
-    a = CuArray(data)
-    initial_capacity = a.maxsize
-    @test initial_capacity == sizeof(a)
-
-    # resizing an array should increment the capacity
-    CUDA.resize!(a, 4)
-    @test length(a) == 4
-    @test Array(a)[1:3] == data
-    resized_capacity = a.maxsize
-    @test resized_capacity > sizeof(a)
-
-    # resizing again should use the existing capacity
-    CUDA.resize!(a, 5)
+    for data in ([1, 2, 3], [1, nothing, 3])
+        a = CuArray(data)
+        initial_capacity = a.maxsize
+        @test initial_capacity == sizeof(a)
+
+        # resizing an array should increment the capacity
+        CUDA.resize!(a, 4)
+        @test length(a) == 4
+        @test Array(a)[1:3] == data
+        resized_capacity = a.maxsize
+        @test resized_capacity > sizeof(a)
+
+        # resizing again should use the existing capacity
+        CUDA.resize!(a, 5)
     @test length(a) == 5
-    @test a.maxsize == resized_capacity
-
-    # resizing significantly should trigger an exact reallocation
-    CUDA.resize!(a, 1000)
-    @test length(a) == 1000
-    @test Array(a)[1:3] == data
-    resized_capacity = a.maxsize
-    @test resized_capacity == sizeof(a)
-
-    # shrinking back down shouldn't immediately reduce capacity
-    CUDA.resize!(a, 999)
-    @test length(a) == 999
-    @test a.maxsize == resized_capacity
-
-    # shrinking significantly should trigger an exact reallocation
-    CUDA.resize!(a, 10)
-    @test length(a) == 10
-    @test Array(a)[1:3] == data
-    @test a.maxsize == sizeof(a)
-  end
+        @test a.maxsize == resized_capacity
+
+        # resizing significantly should trigger an exact reallocation
+        CUDA.resize!(a, 1000)
+        @test length(a) == 1000
+        @test Array(a)[1:3] == data
+        resized_capacity = a.maxsize
+        @test resized_capacity == sizeof(a)
+
+        # shrinking back down shouldn't immediately reduce capacity
+        CUDA.resize!(a, 999)
+        @test length(a) == 999
+        @test a.maxsize == resized_capacity
+
+        # shrinking significantly should trigger an exact reallocation
+        CUDA.resize!(a, 10)
+        @test length(a) == 10
+        @test Array(a)[1:3] == data
+        @test a.maxsize == sizeof(a)
+    end
 end
 
 @testset "aliasing" begin

huiyuxie · 2025-07-30T16:23:10Z

@maleadt Please review. Thanks!

maleadt

Can you modify resize! instead of adding a new_resize!?

The change should also make use of the maxsize property of CuArray, which is there to allow additional data to be allocated without the dimensions of the array having to match. resize should be made aware of that, not doing anything when the needed size is smaller than maxsize.

In addition, IIUC you're using a growth factor of 2 now (ignoring resizes when shrinking by up to a half, resizing by 2 when requesting a larger array), which I think may be too aggressive for GPU arrays. For small arrays it's probably fine, but at some point (> 10MB?) we should probably use a fixed (1MB?) increment instead.

In addition,

huiyuxie · 2025-08-05T18:49:03Z

The change should also make use of the maxsize property of CuArray, which is there to allow additional data to be allocated without the dimensions of the array having to match.

👍

which I think may be too aggressive for GPU arrays. For small arrays it's probably fine, but at some point (> 10MB?) we should probably use a fixed (1MB?) increment instead.

👍 but why do we choose 10MB and 1MB?

In addition,

Do you have any other comments?

src/array.jl

codecov · 2025-08-05T23:15:57Z

Codecov Report

❌ Patch coverage is 96.77419% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.32%. Comparing base (f7deec6) to head (f477f48).
⚠️ Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
src/array.jl	96.77%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2828      +/-   ##
==========================================
- Coverage   89.45%   89.32%   -0.14%     
==========================================
  Files         150      150              
  Lines       13078    13093      +15     
==========================================
- Hits        11699    11695       -4     
- Misses       1379     1398      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

huiyuxie · 2025-08-06T01:08:41Z

Can you @maleadt point me to the existing CI benchmark results? I could not find them. Thanks!

src/array.jl

huiyuxie · 2025-09-24T01:46:39Z

Please review @maleadt. Thanks!

maleadt · 2025-10-15T11:27:37Z

Sorry for the delay. I simplified the tests a little and added support for shrinking, as well as isbits-union arrays (although untested).

but why do we choose 10MB and 1MB?

That was arbitrary. Would probably be useful to see what other projects do.

github-actions

CUDA.jl Benchmarks

Benchmark suite	Current: `17565a4`	Previous: `2130acf`	Ratio
`latency/precompile`	`56738141787` ns	`56500883334.5` ns	`1.00`
`latency/ttfp`	`8429462389` ns	`8365271392` ns	`1.01`
`latency/import`	`4514032627` ns	`4507842458` ns	`1.00`
`integration/volumerhs`	`9628219` ns	`9611007.5` ns	`1.00`
`integration/byval/slices=1`	`147198` ns	`146935` ns	`1.00`
`integration/byval/slices=3`	`426477` ns	`425946` ns	`1.00`
`integration/byval/reference`	`145011` ns	`145067` ns	`1.00`
`integration/byval/slices=2`	`286716` ns	`286530` ns	`1.00`
`integration/cudadevrt`	`103677` ns	`103628` ns	`1.00`
`kernel/indexing`	`14292` ns	`14200` ns	`1.01`
`kernel/indexing_checked`	`14954` ns	`15046.5` ns	`0.99`
`kernel/occupancy`	`707.3241379310344` ns	`677.6815286624204` ns	`1.04`
`kernel/launch`	`2156.1111111111113` ns	`2183.1111111111113` ns	`0.99`
`kernel/rand`	`18723` ns	`14941` ns	`1.25`
`array/reverse/1d`	`20126` ns	`20250` ns	`0.99`
`array/reverse/2dL_inplace`	`66949` ns	`67030` ns	`1.00`
`array/reverse/1dL`	`70325` ns	`70487` ns	`1.00`
`array/reverse/2d`	`22042` ns	`22100` ns	`1.00`
`array/reverse/1d_inplace`	`9673` ns	`9646` ns	`1.00`
`array/reverse/2d_inplace`	`13448` ns	`13444` ns	`1.00`
`array/reverse/2dL`	`74074` ns	`74138` ns	`1.00`
`array/reverse/1dL_inplace`	`66854` ns	`66810` ns	`1.00`
`array/copy`	`20997` ns	`20566` ns	`1.02`
`array/iteration/findall/int`	`158461` ns	`158051` ns	`1.00`
`array/iteration/findall/bool`	`140371` ns	`140105.5` ns	`1.00`
`array/iteration/findfirst/int`	`161326.5` ns	`161684.5` ns	`1.00`
`array/iteration/findfirst/bool`	`162046` ns	`162377` ns	`1.00`
`array/iteration/scalar`	`72273` ns	`73398` ns	`0.98`
`array/iteration/logical`	`217476` ns	`216289` ns	`1.01`
`array/iteration/findmin/1d`	`51081` ns	`50985` ns	`1.00`
`array/iteration/findmin/2d`	`97113` ns	`96912` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`43245` ns	`43765` ns	`0.99`
`array/reductions/reduce/Int64/dims=1`	`44805` ns	`45043` ns	`0.99`
`array/reductions/reduce/Int64/dims=2`	`61576` ns	`61514` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`89140` ns	`89178.5` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`88283` ns	`88181` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`37404` ns	`37076.5` ns	`1.01`
`array/reductions/reduce/Float32/dims=1`	`41817` ns	`41802.5` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`60022` ns	`59975` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`52719` ns	`52588` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`72606` ns	`72470` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`43496` ns	`43472` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1`	`55559.5` ns	`44367` ns	`1.25`
`array/reductions/mapreduce/Int64/dims=2`	`61790` ns	`61825` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`89096` ns	`89416` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`88662` ns	`88547` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`38506` ns	`37303` ns	`1.03`
`array/reductions/mapreduce/Float32/dims=1`	`42206.5` ns	`46174.5` ns	`0.91`
`array/reductions/mapreduce/Float32/dims=2`	`60340` ns	`60053` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`53001` ns	`52864.5` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`72556` ns	`72451.5` ns	`1.00`
`array/broadcast`	`20671` ns	`20237.5` ns	`1.02`
`array/copyto!/gpu_to_gpu`	`11734` ns	`13049` ns	`0.90`
`array/copyto!/cpu_to_gpu`	`214209` ns	`215931` ns	`0.99`
`array/copyto!/gpu_to_cpu`	`284311` ns	`284076.5` ns	`1.00`
`array/accumulate/Int64/1d`	`124765` ns	`124697` ns	`1.00`
`array/accumulate/Int64/dims=1`	`83540` ns	`83423` ns	`1.00`
`array/accumulate/Int64/dims=2`	`157883` ns	`157799` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1709761` ns	`1710025` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`966484` ns	`966307` ns	`1.00`
`array/accumulate/Float32/1d`	`109093` ns	`109616` ns	`1.00`
`array/accumulate/Float32/dims=1`	`80727` ns	`80549` ns	`1.00`
`array/accumulate/Float32/dims=2`	`147490.5` ns	`147930` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1618951.5` ns	`1618991` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`698350` ns	`698663` ns	`1.00`
`array/construct`	`1258.8` ns	`1280.7` ns	`0.98`
`array/random/randn/Float32`	`44815` ns	`44974` ns	`1.00`
`array/random/randn!/Float32`	`24777` ns	`25380` ns	`0.98`
`array/random/rand!/Int64`	`27309` ns	`27271` ns	`1.00`
`array/random/rand!/Float32`	`8916.333333333334` ns	`9033.666666666666` ns	`0.99`
`array/random/rand/Int64`	`29707` ns	`29955` ns	`0.99`
`array/random/rand/Float32`	`13012` ns	`13512` ns	`0.96`
`array/permutedims/4d`	`60302` ns	`60442` ns	`1.00`
`array/permutedims/2d`	`54019` ns	`54237.5` ns	`1.00`
`array/permutedims/3d`	`55029.5` ns	`55099` ns	`1.00`
`array/sorting/1d`	`2757377` ns	`2757475` ns	`1.00`
`array/sorting/by`	`3368301.5` ns	`3344486` ns	`1.01`
`array/sorting/2d`	`1088432.5` ns	`1080698` ns	`1.01`
`cuda/synchronization/stream/auto`	`1065.1` ns	`1042.3` ns	`1.02`
`cuda/synchronization/stream/nonblocking`	`8403` ns	`7397.299999999999` ns	`1.14`
`cuda/synchronization/stream/blocking`	`836.468085106383` ns	`814.7333333333333` ns	`1.03`
`cuda/synchronization/context/auto`	`1202.5` ns	`1171.5` ns	`1.03`
`cuda/synchronization/context/nonblocking`	`7957.9` ns	`8941.8` ns	`0.89`
`cuda/synchronization/context/blocking`	`923.0076923076923` ns	`908.7894736842105` ns	`1.02`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2025-10-15T14:10:30Z

CI failure related.

maleadt requested changes Aug 5, 2025

View reviewed changes

huiyuxie commented Aug 5, 2025

View reviewed changes

src/array.jl Outdated Show resolved Hide resolved

huiyuxie requested a review from maleadt August 5, 2025 18:53

huiyuxie commented Aug 6, 2025

View reviewed changes

src/array.jl Show resolved Hide resolved

huiyuxie and others added 7 commits October 15, 2025 12:47

First draft

cfe8121

Test and benchmark

ff4ccb0

Remove benchmark

3cc21c7

Fix comments

06a5450

Update

146513b

Fix comments

e416395

Simplify + support shrinking.

d9f2c33

maleadt force-pushed the fix branch from 0f8615c to d9f2c33 Compare October 15, 2025 11:26

Tweak parameters.

17565a4

github-actions bot reviewed Oct 15, 2025

View reviewed changes

maleadt added 2 commits October 16, 2025 09:51

Fix.

1e8df27

Also test on isbits union arrays.

f477f48

maleadt merged commit c4598ed into JuliaGPU:master Oct 16, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `resize!` run faster #2828

Make `resize!` run faster #2828

Uh oh!

huiyuxie commented Jul 30, 2025

Uh oh!

github-actions bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

huiyuxie commented Jul 30, 2025

Uh oh!

maleadt left a comment

Uh oh!

huiyuxie commented Aug 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

huiyuxie commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

huiyuxie commented Sep 24, 2025

Uh oh!

maleadt commented Oct 15, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

maleadt commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make resize! run faster #2828

Make resize! run faster #2828

Uh oh!

Conversation

huiyuxie commented Jul 30, 2025

Uh oh!

github-actions bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huiyuxie commented Jul 30, 2025

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

huiyuxie commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

huiyuxie commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

huiyuxie commented Sep 24, 2025

Uh oh!

maleadt commented Oct 15, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

maleadt commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make `resize!` run faster #2828

Make `resize!` run faster #2828

github-actions bot commented Jul 30, 2025 •

edited

Loading

huiyuxie commented Aug 5, 2025 •

edited

Loading

codecov bot commented Aug 5, 2025 •

edited

Loading

huiyuxie commented Aug 6, 2025 •

edited

Loading