Cast support for RunEndEncoded arrays #8589

vegarsti · 2025-10-11T05:36:43Z

Which issue does this PR close?

Contribues towards the RunEndEncoded (REE) epic [Epic] Implement RunArray (Run Length Encoding (RLE) / Run End Encoding (REE) support) #3520, but there is no specific issue for casting.
Replaces PRs Implemented casting for RunEnd Encoding #7713 and [Draft] Implemented casting for RunEnd Encoding (pt2) #8384.

Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache Arrow.

What changes are included in this PR?

run_end_encoded_cast in arrow-cast/src/cast/run_array.rs
cast_to_run_end_encoded in arrow-cast/src/cast/run_array.rs
Tests in arrow-cast/src/cast/mod.rs

Are these changes tested?

Yes!

Are there any user-facing changes?

No breaking changes, just new functionality

vegarsti · 2025-10-11T06:26:59Z

Raised this PR to get Richard Baah's excellent work over the line! cc @albertlockett @brancz @alamb @Rich-T-kid

brancz

I think we're getting close to the finish line with these changes!

arrow-cast/src/cast/run_array.rs

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes

…entify runs

tustvold · 2025-10-12T22:20:03Z

Is there some way we can avoid the quadratic codegen with code paths parameterized on both run end type and value type? Perhaps it'd be possible to identify where the transitions are, perhaps using the comparison kernels and comparing the array with a slice offset by one, and then use this to construct the indexes and a filter to construct the values array?

Have we done any empirical quantification into the impact this has on code bloat / compile times?

Edit: https://docs.rs/arrow-ord/latest/arrow_ord/partition/fn.partition.html is the function I'm thinking of.

vegarsti · 2025-10-13T06:31:01Z

Have we done any empirical quantification into the impact this has on code bloat / compile times?

I have not! Happy to do that though. Any pointers to how you'd like me to do that, from previous PRs for example? Or does a basic comparison of compile time and binary size on main and this branch suffice?

tustvold · 2025-10-13T08:46:27Z

Or does a basic comparison of compile time and binary size on main and this branch suffice?

Just this, quadratic codegen is typically severe enough to be easily measurable.

vegarsti · 2025-10-13T10:18:53Z

The compile time increased by 2 seconds.

         cargo build --release
main     569.35s user 23.69s system 863% cpu 1:08.66 total
branch   567.33s user 23.96s system 891% cpu 1:06.33 total

The size of libarrow_cast.rlib increased by 279kb (3.82%)

         libarrow_cast.rlib size
main     7,316,832
branch   7,596,568

tustvold · 2025-10-13T10:21:47Z

Yeah... That's quite bad for a single kernel, especially given the relatively niche usage of RunEndEncodedArrays, I hope you can understand that we need to be careful to keep this under control.

What did you think of my suggestion about using the partition kernel to compute the run ends? It might actually be faster and would largely eliminate the additional codegen.

It would mean making arrow-cast depend on arrow-ord, which is a bit meh, but perhaps unavoidable. It could possibly be a feature flag. 🤔

vegarsti · 2025-10-13T11:14:08Z

I understand! I haven't had time to look into your suggestion, but I will.

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

tustvold · 2025-10-13T11:23:03Z

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it.

vegarsti · 2025-10-13T11:31:16Z

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it.

Thanks for the context!

brancz · 2025-10-13T19:41:00Z

We're talking about the pack_runs macro right? I realize it's nice as a macro, but it also seems fine to just write out by hand.

vegarsti · 2025-10-18T07:18:28Z

@tustvold I've addressed your comments now, let me know what you think. Thanks for the helpful and quick review!

vegarsti · 2025-10-21T10:35:46Z

Will you be able to have a look at it soon @tustvold? Sounded like you were almost ready to approve it 😄 (Sorry for nagging!)

brancz

Love this approach with ord, also reads far simpler than before. Good suggestion @tustvold and nice work @vegarsti!

(I can't actually approve, but symbolically I'm approving this)

vegarsti · 2025-10-22T08:49:57Z

Thanks for the symbolic approve! 😄

alamb

Thank you @vegarsti @brancz and @tustvold

I went through this PR carefully and while I have a few suggestions for comments and additional tests I think they could be done as follow on PRs too if you prefer

What I suggest as a flow is:

@vegarsti addresses any comments they would like
File follow on tickets: benchmark for REE casting, and some ideas for performance optimizations.
Merge this PR

Thanks again 🚀

alamb · 2025-10-23T16:44:30Z

arrow-cast/Cargo.toml

 arrow-array = { workspace = true }
 arrow-buffer = { workspace = true }
 arrow-data = { workspace = true }
+arrow-ord = { workspace = true }


I think we are trying to keep the number of dependencies to a minimum

I see this is used to call partition which is clever but overly general. I think you can also partition a (single) column using eq and an offset to look for consecutive rows which are different.

Something like

let arr = ...; let arr_shift1 = arr.slice(1, arr.len()-1); let transitions = eq(arr, arr_shift_1);

However, the eq kernels is also in arrow-ord so I am not sure there is a way around it

Oh, that's nice!

Does it suffice here to file an issue for potentially getting rid of this dependency?

The way I'm reading this is eq might be faster but won't get rid of the dependency, but getting rid of the dependency would be the best option. Is that correct?

Filed #8708 for getting rid of arrow-ord and #8707 for performance improvement

arrow-cast/src/cast/run_array.rs

arrow-cast/src/cast/mod.rs

alamb · 2025-10-23T17:13:36Z

looks like there are some minor CI issues to address too

vegarsti · 2025-10-23T17:19:56Z

Looks like someone triggered CI - thank you! There are failures, will address tomorrow.

Edit: Oops, posted this before seeing the review.

vegarsti · 2025-10-23T19:50:12Z

Thanks so much @alamb, this is excellent! Will ping you again after addressing.

vegarsti · 2025-10-25T07:47:01Z

@alamb I've addressed your very helpful review, thank you so much. I think this is ready to merge now - if you agree, that is. I've filed issues

Add benchmark for RunEndEncoded casting #8709 for benchmark
Improve performance of RunEndEncoded cast #8707 for performance improvement
Remove arrow-ord dependency in arrow-cast due to RunEndEncoded casting #8708 for removing arrow-ord dependency

vegarsti · 2025-10-26T05:41:57Z

@alamb I've addressed your very helpful review, thank you so much. I think this is ready to merge now - if you agree, that is. I've filed issues

Add benchmark for RunEndEncoded casting #8709 for benchmark

Improve performance of RunEndEncoded cast #8707 for performance improvement

Remove arrow-ord dependency in arrow-cast due to RunEndEncoded casting #8708 for removing arrow-ord dependency

PR for #8709: #8710

…arrays

alamb · 2025-10-26T10:43:05Z

I pushed a commit to solve the RAT ci test, and merged up from main

https://github.com/apache/arrow-rs/actions/runs/18813730483/job/53685630690?pr=8589

alamb · 2025-10-26T11:45:02Z

Thank you @vegarsti

github-actions bot added the arrow Changes to the arrow crate label Oct 11, 2025

This was referenced Oct 11, 2025

[Draft] Implemented casting for RunEnd Encoding (pt2) #8384

Draft

Support writing RunEndEncoded as Parquet #8069

Open

vegarsti changed the title ~~Casting to/from RunEndEncoded arrays~~ Casting support for RunEndEncoded arrays Oct 11, 2025

vegarsti force-pushed the cast-run-end-encoded-arrays branch 2 times, most recently from 87e543d to f9ae6f9 Compare October 11, 2025 06:25

brancz reviewed Oct 11, 2025

View reviewed changes

arrow-cast/src/cast/run_array.rs Outdated Show resolved Hide resolved

rich-t-kid-datadog and others added 10 commits October 12, 2025 07:49

Implemented casting for RunEnd Encoding

ca051e2

Implemented casting for RunEnd Encoding

60c52b4

feat: Add Run-End Encoded array casting with overflow protection

0a6d865

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.

feat: Add Run-End Encoded array casting with overflow protection

8b434d4

Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes

Use type specific zero-copy comparisons in cast_to_run_end_encoded

77cda81

Move tests in mod run_end_encoded_tests into mod tests

b666a97

panic if REE in cast_to_run_end_encoded

6eafcea

Use unreachable macro

3c2e837

Simplify some assertions

d1e5120

Extract populate_run_ends_and_values, which casts then iterates to id…

2358010

…entify runs

vegarsti force-pushed the cast-run-end-encoded-arrays branch from 88c0d8a to 2358010 Compare October 12, 2025 05:49

vegarsti added 2 commits October 12, 2025 08:39

Add missing Float16 and Decimal types to can_cast_to_run_end_encoded

7ed2872

Use a macro for packing runs

692f6ea

vegarsti added 3 commits October 18, 2025 08:50

Simplify variables in partition loop

e086d4c

Partition on cast_array, not array

a16d555

Support casting from dictionary types and add test for that

82c384b

vegarsti mentioned this pull request Oct 22, 2025

[Epic] Implement RunArray (Run Length Encoding (RLE) / Run End Encoding (REE) support) #3520

Open

16 tasks

brancz approved these changes Oct 22, 2025

View reviewed changes

alamb approved these changes Oct 23, 2025

View reviewed changes

vegarsti added 4 commits October 25, 2025 08:17

Remove can_cast_to_run_end_encoded

694814c

Improve run_end_encoded_cast, cast_to_run_end_encoded

17f4f6f

Address comments on tests

42c4044

Appease clippy

2f2c5e6

This was referenced Oct 25, 2025

Improve performance of RunEndEncoded cast #8707

Open

Remove arrow-ord dependency in arrow-cast due to RunEndEncoded casting #8708

Open

Add benchmark for RunEndEncoded casting #8709

Closed

vegarsti changed the title ~~Casting support for RunEndEncoded arrays~~ Cast support for RunEndEncoded arrays Oct 25, 2025

vegarsti mentioned this pull request Oct 26, 2025

Add benchmark for casting to RunEndEncoded (REE) #8710

Merged

vegarsti and others added 3 commits October 26, 2025 06:45

Add the index in an out of range error message

2a1f80f

Add apache license to pass RAT

aab0084

Merge remote-tracking branch 'apache/main' into cast-run-end-encoded-…

5e43f67

…arrays

alamb merged commit c149027 into apache:main Oct 26, 2025
27 checks passed

vegarsti mentioned this pull request Oct 27, 2025

perf: Reduce allocations in cast_to_run_end_encoded #8726

Open

Uh oh!

Cast support for RunEndEncoded arrays #8589

Cast support for RunEndEncoded arrays #8589

Uh oh!

Conversation

vegarsti commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

vegarsti commented Oct 11, 2025

Uh oh!

brancz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tustvold commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented Oct 13, 2025

Uh oh!

vegarsti commented Oct 13, 2025

Uh oh!

tustvold commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vegarsti commented Oct 13, 2025

Uh oh!

brancz commented Oct 13, 2025

Uh oh!

vegarsti commented Oct 18, 2025

Uh oh!

vegarsti commented Oct 21, 2025

Uh oh!

brancz left a comment

Choose a reason for hiding this comment

Uh oh!

vegarsti commented Oct 22, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Oct 23, 2025

Uh oh!

vegarsti commented Oct 11, 2025 •

edited

Loading

tustvold commented Oct 12, 2025 •

edited

Loading

vegarsti commented Oct 13, 2025 •

edited

Loading

tustvold commented Oct 13, 2025 •

edited

Loading

vegarsti commented Oct 13, 2025 •

edited

Loading

tustvold commented Oct 13, 2025 •

edited

Loading

vegarsti commented Oct 23, 2025 •

edited

Loading

vegarsti commented Oct 25, 2025 •

edited

Loading