-
Couldn't load subscription status.
- Fork 1k
Cast support for RunEndEncoded arrays #8589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
87e543d to
f9ae6f9
Compare
|
Raised this PR to get Richard Baah's excellent work over the line! cc @albertlockett @brancz @alamb @Rich-T-kid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're getting close to the finish line with these changes!
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences.
Implement casting between REE arrays and other Arrow types. REE-to-REE casting validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent invalid sequences. rebased changes
88c0d8a to
2358010
Compare
|
Is there some way we can avoid the quadratic codegen with code paths parameterized on both run end type and value type? Perhaps it'd be possible to identify where the transitions are, perhaps using the comparison kernels and comparing the array with a slice offset by one, and then use this to construct the indexes and a filter to construct the values array? Have we done any empirical quantification into the impact this has on code bloat / compile times? Edit: https://docs.rs/arrow-ord/latest/arrow_ord/partition/fn.partition.html is the function I'm thinking of. |
I have not! Happy to do that though. Any pointers to how you'd like me to do that, from previous PRs for example? Or does a basic comparison of compile time and binary size on main and this branch suffice? |
Just this, quadratic codegen is typically severe enough to be easily measurable. |
|
The compile time increased by 2 seconds. The size of |
|
Yeah... That's quite bad for a single kernel, especially given the relatively niche usage of RunEndEncodedArrays, I hope you can understand that we need to be careful to keep this under control. What did you think of my suggestion about using the partition kernel to compute the run ends? It might actually be faster and would largely eliminate the additional codegen. It would mean making arrow-cast depend on arrow-ord, which is a bit meh, but perhaps unavoidable. It could possibly be a feature flag. 🤔 |
|
I understand! I haven't had time to look into your suggestion, but I will. Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE? |
Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it. |
Thanks for the context! |
|
We're talking about the pack_runs macro right? I realize it's nice as a macro, but it also seems fine to just write out by hand. |
|
@tustvold I've addressed your comments now, let me know what you think. Thanks for the helpful and quick review! |
|
Will you be able to have a look at it soon @tustvold? Sounded like you were almost ready to approve it 😄 (Sorry for nagging!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks for the symbolic approve! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vegarsti @brancz and @tustvold
I went through this PR carefully and while I have a few suggestions for comments and additional tests I think they could be done as follow on PRs too if you prefer
What I suggest as a flow is:
- @vegarsti addresses any comments they would like
- File follow on tickets: benchmark for REE casting, and some ideas for performance optimizations.
- Merge this PR
Thanks again 🚀
| arrow-array = { workspace = true } | ||
| arrow-buffer = { workspace = true } | ||
| arrow-data = { workspace = true } | ||
| arrow-ord = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are trying to keep the number of dependencies to a minimum
I see this is used to call partition which is clever but overly general. I think you can also partition a (single) column using eq and an offset to look for consecutive rows which are different.
Something like
let arr = ...;
let arr_shift1 = arr.slice(1, arr.len()-1);
let transitions = eq(arr, arr_shift_1);However, the eq kernels is also in arrow-ord so I am not sure there is a way around it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it suffice here to file an issue for potentially getting rid of this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I'm reading this is eq might be faster but won't get rid of the dependency, but getting rid of the dependency would be the best option. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
looks like there are some minor CI issues to address too |
|
Looks like someone triggered CI - thank you! There are failures, will address tomorrow. Edit: Oops, posted this before seeing the review. |
|
Thanks so much @alamb, this is excellent! Will ping you again after addressing. |
|
@alamb I've addressed your very helpful review, thank you so much. I think this is ready to merge now - if you agree, that is. I've filed issues
|
|
|
I pushed a commit to solve the RAT ci test, and merged up from main https://github.com/apache/arrow-rs/actions/runs/18813730483/job/53685630690?pr=8589 |
|
Thank you @vegarsti |
Which issue does this PR close?
RunArray(Run Length Encoding (RLE) / Run End Encoding (REE) support) #3520, but there is no specific issue for casting.Rationale for this change
This PR implements casting support for RunEndEncoded arrays in Apache Arrow.
What changes are included in this PR?
run_end_encoded_castinarrow-cast/src/cast/run_array.rscast_to_run_end_encodedinarrow-cast/src/cast/run_array.rsarrow-cast/src/cast/mod.rsAre these changes tested?
Yes!
Are there any user-facing changes?
No breaking changes, just new functionality