Skip to content

Conversation

@rluvaton
Copy link
Member

@rluvaton rluvaton commented Oct 19, 2025

Which issue does this PR close?

N/A

Rationale for this change

doing OffsetBuffer::from_lengths(std::iter::repeat_n(size, value.len())); does not utilize SIMD (I explain further if you want)
See GodBolt Link

Extracted from:

After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast:

What changes are included in this PR?

added new function

Are these changes tested?

yes

Are there any user-facing changes?

yes

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 19, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @rluvaton

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1017-gcp #18~24.04.1-Ubuntu SMP Tue Sep 23 17:51:44 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing add-from-length-repeated-for-offset-buffer (3317a39) to 4fc9302 diff
BENCH_NAME=cast_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench cast_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=add-from-length-repeated-for-offset-buffer
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

🤖: Benchmark completed

Details

group                                                              add-from-length-repeated-for-offset-buffer    main
-----                                                              ------------------------------------------    ----
cast binary view to string                                         1.01     73.3±0.30µs        ? ?/sec           1.00     72.5±0.16µs        ? ?/sec
cast binary view to string view                                    1.04    105.4±0.24µs        ? ?/sec           1.00    101.6±0.25µs        ? ?/sec
cast binary view to wide string                                    1.00     70.2±0.23µs        ? ?/sec           1.00     70.1±0.24µs        ? ?/sec
cast date32 to date64 512                                          1.02    299.6±2.42ns        ? ?/sec           1.00    293.6±0.31ns        ? ?/sec
cast date64 to date32 512                                          1.00    506.9±2.67ns        ? ?/sec           1.00    504.8±0.51ns        ? ?/sec
cast decimal128 to decimal128 512                                  1.00    612.5±0.49ns        ? ?/sec           1.00    612.7±0.57ns        ? ?/sec
cast decimal128 to decimal128 512 lower precision                  1.00      5.3±0.02µs        ? ?/sec           1.01      5.3±0.02µs        ? ?/sec
cast decimal128 to decimal128 512 with lower scale (infallible)    1.00      6.5±0.01µs        ? ?/sec           1.00      6.5±0.01µs        ? ?/sec
cast decimal128 to decimal128 512 with same scale                  1.00     75.7±0.74ns        ? ?/sec           1.00     75.9±0.11ns        ? ?/sec
cast decimal128 to decimal256 512                                  1.00      2.5±0.00µs        ? ?/sec           1.00      2.5±0.01µs        ? ?/sec
cast decimal256 to decimal128 512                                  1.00     48.8±0.11µs        ? ?/sec           1.02     49.6±0.15µs        ? ?/sec
cast decimal256 to decimal256 512                                  1.00     10.9±0.03µs        ? ?/sec           1.00     10.9±0.03µs        ? ?/sec
cast decimal256 to decimal256 512 with same scale                  1.00     75.2±0.19ns        ? ?/sec           1.00     75.5±0.07ns        ? ?/sec
cast decimal32 to decimal32 512                                    1.00      2.3±0.01µs        ? ?/sec           1.00      2.3±0.01µs        ? ?/sec
cast decimal32 to decimal32 512 lower precision                    1.00      3.1±0.01µs        ? ?/sec           1.00      3.1±0.01µs        ? ?/sec
cast decimal32 to decimal64 512                                    1.00    319.7±0.94ns        ? ?/sec           1.01    322.9±0.60ns        ? ?/sec
cast decimal64 to decimal32 512                                    1.00      3.5±0.01µs        ? ?/sec           1.00      3.5±0.01µs        ? ?/sec
cast decimal64 to decimal64 512                                    1.00    387.4±0.96ns        ? ?/sec           1.00    387.8±0.39ns        ? ?/sec
cast dict to string view                                           1.04     53.9±0.09µs        ? ?/sec           1.00     52.0±0.28µs        ? ?/sec
cast f32 to string 512                                             1.03     19.3±0.17µs        ? ?/sec           1.00     18.8±0.06µs        ? ?/sec
cast f64 to string 512                                             1.00     21.6±0.21µs        ? ?/sec           1.00     21.6±0.08µs        ? ?/sec
cast float32 to int32 512                                          1.00  1568.0±15.03ns        ? ?/sec           1.00   1561.2±2.82ns        ? ?/sec
cast float64 to float32 512                                        1.03   1103.9±1.66ns        ? ?/sec           1.00   1074.8±3.44ns        ? ?/sec
cast float64 to uint64 512                                         1.00   1762.3±3.69ns        ? ?/sec           1.00  1770.5±10.49ns        ? ?/sec
cast i64 to string 512                                             1.00     14.2±0.10µs        ? ?/sec           1.00     14.2±0.04µs        ? ?/sec
cast int32 to float32 512                                          1.03   1069.1±9.13ns        ? ?/sec           1.00   1038.8±1.49ns        ? ?/sec
cast int32 to float64 512                                          1.02   1059.1±1.71ns        ? ?/sec           1.00   1037.0±2.35ns        ? ?/sec
cast int32 to int32 512                                            1.00    198.4±0.34ns        ? ?/sec           1.08    214.7±0.97ns        ? ?/sec
cast int32 to int64 512                                            1.02   1049.3±2.07ns        ? ?/sec           1.00   1029.1±2.74ns        ? ?/sec
cast int32 to uint32 512                                           1.02   1491.2±9.72ns        ? ?/sec           1.00  1462.2±10.65ns        ? ?/sec
cast int64 to int32 512                                            1.00   1692.3±3.21ns        ? ?/sec           1.01  1701.8±13.59ns        ? ?/sec
cast string to binary view 512                                     1.15      3.8±0.01µs        ? ?/sec           1.00      3.3±0.01µs        ? ?/sec
cast string view to binary view                                    1.00     96.9±0.21ns        ? ?/sec           1.06    103.1±0.12ns        ? ?/sec
cast string view to dict                                           1.00    169.3±0.32µs        ? ?/sec           1.13    190.5±1.33µs        ? ?/sec
cast string view to string                                         1.04     56.6±0.52µs        ? ?/sec           1.00     54.5±0.14µs        ? ?/sec
cast string view to wide string                                    1.04     50.9±0.74µs        ? ?/sec           1.00     48.8±0.11µs        ? ?/sec
cast time32s to time32ms 512                                       1.13    286.1±2.03ns        ? ?/sec           1.00    252.8±3.90ns        ? ?/sec
cast time32s to time64us 512                                       1.00    295.0±2.26ns        ? ?/sec           1.03    304.2±4.13ns        ? ?/sec
cast time64ns to time32s 512                                       1.01    511.7±4.05ns        ? ?/sec           1.00    507.7±0.84ns        ? ?/sec
cast timestamp_ms to i64 512                                       1.04    453.3±2.25ns        ? ?/sec           1.00    433.9±0.69ns        ? ?/sec
cast timestamp_ms to timestamp_ns 512                              1.00      2.2±0.01µs        ? ?/sec           1.00      2.2±0.00µs        ? ?/sec
cast timestamp_ns to timestamp_s 512                               1.00    201.1±0.60ns        ? ?/sec           1.05    210.6±0.27ns        ? ?/sec
cast utf8 to date32 512                                            1.00     11.3±0.02µs        ? ?/sec           1.02     11.5±0.03µs        ? ?/sec
cast utf8 to date64 512                                            1.06     47.6±0.09µs        ? ?/sec           1.00     44.8±0.07µs        ? ?/sec
cast utf8 to f32                                                   1.00     11.1±0.02µs        ? ?/sec           1.04     11.6±0.02µs        ? ?/sec
cast wide string to binary view 512                                1.00      5.6±0.01µs        ? ?/sec           1.02      5.8±0.01µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Oct 20, 2025

🤖: Benchmark completed

There appear to be no benchmarks for casting FSB --> ListArray (which this PR should speed up)

@rluvaton
Copy link
Member Author

I just saw a code here that used repeat_n, and modify it to use this.

I did not benchmark that case

@alamb alamb merged commit c74cbf2 into apache:main Oct 20, 2025
31 checks passed
@rluvaton rluvaton deleted the add-from-length-repeated-for-offset-buffer branch October 20, 2025 20:52
kylebarron pushed a commit that referenced this pull request Oct 21, 2025
…<repeat>));` with `OffsetBuffer::from_repeated_length(<val>, <repeat>);` (#8669)

# Which issue does this PR close?

N/A

# Rationale for this change

Use the dedicated faster function for creating offset with the same
length

# What changes are included in this PR?

replace
```rust
OffsetBuffer::from_lengths(std::iter::repeat_n(<val>, <repeat>));
```

with
```rust
OffsetBuffer::from_repeated_length(<val>, <repeat>);
```

# Are these changes tested?

Existing tests

# Are there any user-facing changes?

Nope

----

Related to:
- #8656
alamb pushed a commit that referenced this pull request Oct 21, 2025
# Which issue does this PR close?

N/A

# Rationale for this change

I want to repeat the same value multiple times in a very fast way
which will be used in:
- #8653

After this and the pr below is merged will improve the datafusion scalar
to array to use this and make it really really fast:
- #8656 

# What changes are included in this PR?

Created a function in `MutableBuffer` to repeat a slice a number of
times in a logarithmic way to reduce memcopy calls

# Are these changes tested?

Yes

# Are there any user-facing changes?

Yes, and added docs

-------

Extracted from:
- #8653

Benchmark results on local machine

| Slice Length | Repetitions (n) | repeat_slice_n_times |
extend_from_slice loop | Speedup |

|--------------|-----------------|----------------------|------------------------|---------|
| 3 | 3 | 47.092 ns | 41.910 ns | 0.89x |
| 3 | 64 | 63.548 ns | 222.29 ns | 3.50x |
| 3 | 1024 | 105.57 ns | 3.031 µs | 28.7x |
| 3 | 8192 | 405.71 ns | 24.170 µs | 59.6x |
| 20 | 3 | 48.437 ns | 46.437 ns | 0.96x |
| 20 | 64 | 74.993 ns | 319.04 ns | 4.25x |
| 20 | 1024 | 350.94 ns | 4.437 µs | 12.6x |
| 20 | 8192 | 2.440 µs | 35.524 µs | 14.6x |
| 100 | 3 | 50.369 ns | 47.568 ns | 0.94x |
| 100 | 64 | 119.70 ns | 165.37 ns | 1.38x |
| 100 | 1024 | 1.734 µs | 2.623 µs | 1.51x |
| 100 | 8192 | 10.615 µs | 19.750 µs | 1.86x |

these are the results:

<details>
<summary>Result</summary>


```
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=3
                        time:   [46.719 ns 47.092 ns 47.453 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=3
                        time:   [41.833 ns 41.910 ns 41.996 ns]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=64
                        time:   [62.935 ns 63.548 ns 64.183 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=64
                        time:   [221.75 ns 222.29 ns 222.86 ns]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=1024
                        time:   [105.15 ns 105.57 ns 106.01 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=1024
                        time:   [3.0240 µs 3.0308 µs 3.0395 µs]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=8192
                        time:   [401.57 ns 405.71 ns 409.94 ns]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=8192
                        time:   [24.124 µs 24.170 µs 24.222 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=3
                        time:   [48.287 ns 48.437 ns 48.606 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=3
                        time:   [46.289 ns 46.437 ns 46.611 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=64
                        time:   [74.625 ns 74.993 ns 75.395 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=64
                        time:   [318.20 ns 319.04 ns 319.98 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=1024
                        time:   [346.66 ns 350.94 ns 355.17 ns]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=1024
                        time:   [4.4251 µs 4.4369 µs 4.4506 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=8192
                        time:   [2.4336 µs 2.4401 µs 2.4465 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=8192
                        time:   [35.466 µs 35.524 µs 35.589 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=3
                        time:   [50.209 ns 50.369 ns 50.530 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=3
                        time:   [47.439 ns 47.568 ns 47.701 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=64
                        time:   [117.77 ns 119.70 ns 122.00 ns]
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=64
                        time:   [164.88 ns 165.37 ns 166.07 ns]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=1024
                        time:   [1.7278 µs 1.7335 µs 1.7398 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=1024
                        time:   [2.6176 µs 2.6232 µs 2.6305 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=8192
                        time:   [10.583 µs 10.615 µs 10.649 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=8192
                        time:   [19.471 µs 19.750 µs 20.185 µs]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
```

</details>
alamb added a commit that referenced this pull request Oct 28, 2025
Waiting for the PRs below to be merged first:
- [x] #8654 - zip benchmarks

**This PR include the following other PRs (unless merged)** to make the
review easier, so please make sure to review them first
- [x] #8658 - extracted from this
- [x] #8656 - extracted from this


# Which issue does this PR close?

N/A

# Rationale for this change

Making zip really fast for scalars

This is useful for `IF <expr> THEN <literal> ELSE <literal> END`

# What changes are included in this PR?

Created couple of implementation for zipping scalar, for primitive,
bytes and fallback

# Are these changes tested?

existing tests

# Are there any user-facing changes?

new struct `ScalarZipper`

TODO:
- [x] Need to add comments if missing
- [x] Add tests for decimal and timestamp to make sure the type is kept

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants