Change internal array representation to LargeListArray#462
Change internal array representation to LargeListArray#462
LargeListArray#462Conversation
Click here to view all benchmarks. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #462 +/- ##
==========================================
- Coverage 95.97% 95.61% -0.37%
==========================================
Files 20 20
Lines 2286 2324 +38
==========================================
+ Hits 2194 2222 +28
- Misses 92 102 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dougbrn
left a comment
There was a problem hiding this comment.
I think this looks like a reasonable implementation, I have a couple of thoughts/comments:
- Expectedly, there's a performance hit to this change (~10% on two of our benchmarks), and it sounds like you have use cases where you've run into this, but it does hurt to take a hit like this for all cases because of incompatibility at the edges.
- Have you considered some kind of pandas.options parallel here for us to swap the backend? Probably this is a can of worms, but didn't know if you had thought about it at all.
- As to the default value for
large_listkwarg, I don't know I could see arguments for both. I likedFalseinitially for minimal disruption of potential downstream workflows, but not sure if invoking the downcast hits performance at all in these cases? DefaultTrueseems nice in that the only reason someone who try to move off of it would be for fine-tuning performance (again if that even provides a benefit), or downstream compatibility.
|
Thank you, @dougbrn!
Oh, I missed it, it is a very good point! Let's see if I can do anything to improve the performance. I actually believe that this edge-case is very important from the perspective of large-catalog analysis with LSDB. We can also think about alternative designs, see a comment bellow and in the PR description.
I don't like
I think I'll be fine with Meta-comment |
|
I'm converting this to draft and working on the "chunking" alternative. |
|
After the discussions we go with this approach, but with |
Pandas Nightly Test Results (Python 3.11)470 tests +3 453 ✅ +3 17s ⏱️ ±0s For more details on these failures, see this check. Results for commit b7fc60f. ± Comparison against base commit 436dda2. ♻️ This comment has been updated with latest results. |
dougbrn
left a comment
There was a problem hiding this comment.
Overall looks great, thanks for doing so much digging on this! Just one nit to up code coverage which is not a blocking request if you don't have much time
| ) | ||
|
|
||
|
|
||
| def zero_align_offsets(array: pa.LargeListArray | pa.StructArray) -> pa.LargeListArray | pa.StructArray: |
There was a problem hiding this comment.
I noticed that this doesn't have test coverage, would be nice to add to not degrade project coverage too much
There was a problem hiding this comment.
Nice catch, will do!
Changes internal
NestedExtensionArrayto usepa.LargeListArray(int64 offset) instead ofpa.ListArray(int32 offset). This is motivated by wanting to support dataframes with more than 2**31 nested elements, which may be the case when loading large datasets withnested-pandasor returning large results from LSDB with.compute(). (I faced it myself when operating with DP2 pilots.)This PR introduces breaking changes: by default all outputs are now large lists, includingndf.nested.to_lists(),ndf.to_parquet(),pa.array(ndf.nested), etc. However, this PR provides a newlarge_list: bool = Trueargument which, when set toFalse, returns "normal" lists. I'd like to hear opinions on whether we should keep this behavior or set it toFalseby default, from the perspective ofhats/hats-import/lsdbusage.Changed to
large_list: bool = Falseby default, the only case where we have LargeList ispa.array(ndf.nested), but I think it is ok.The alternative design would be a better support of chunked arrays, because we quite aggressively re-chunk the data in some operations. This would be much harder to implement and test, and also could lead to "memory fragmentation" issues in some use cases (for example, concatenation of dozens of thousands of partitions happening when running
lsdb.Catalog.compute()over a large catalog).Closes #95