Change internal array representation to `LargeListArray` by hombit · Pull Request #462 · lincc-frameworks/nested-pandas

hombit · 2026-03-03T18:58:53Z

Changes internal NestedExtensionArray to use pa.LargeListArray (int64 offset) instead of pa.ListArray (int32 offset). This is motivated by wanting to support dataframes with more than 2**31 nested elements, which may be the case when loading large datasets with nested-pandas or returning large results from LSDB with .compute(). (I faced it myself when operating with DP2 pilots.)

This PR introduces breaking changes: by default all outputs are now large lists, including ndf.nested.to_lists(), ndf.to_parquet(), pa.array(ndf.nested), etc. However, this PR provides a new large_list: bool = True argument which, when set to False, returns "normal" lists. I'd like to hear opinions on whether we should keep this behavior or set it to False by default, from the perspective of hats/hats-import/lsdb usage.
Changed to large_list: bool = False by default, the only case where we have LargeList is pa.array(ndf.nested), but I think it is ok.

The alternative design would be a better support of chunked arrays, because we quite aggressively re-chunk the data in some operations. This would be much harder to implement and test, and also could lead to "memory fragmentation" issues in some use cases (for example, concatenation of dozens of thousands of partitions happening when running lsdb.Catalog.compute() over a large catalog).

Closes #95

github-actions · 2026-03-03T19:07:25Z

Before [`436dda2`]	After [`83abc70`]	Ratio	Benchmark (Parameter)
581±200ms	288±200ms	~0.50	benchmarks.ReadFewColumnsHTTPS.time_run
61.6±0.4ms	67.5±5ms	1.10	benchmarks.CountNestedBy.time_run
1.17G	1.23G	1.05	benchmarks.ReadFewColumnsS3.peakmem_run
259M	265M	1.02	benchmarks.AssignSingleDfToNestedSeries.peakmem_run
103M	105M	1.02	benchmarks.NestedFrameAddNested.peakmem_run
108M	110M	1.02	benchmarks.NestedFrameQuery.peakmem_run
107M	109M	1.02	benchmarks.NestedFrameReduce.peakmem_run
10.7±0.2ms	10.8±0.2ms	1.01	benchmarks.NestedFrameAddNested.time_run
9.64±0.06ms	9.75±0.1ms	1.01	benchmarks.NestedFrameQuery.time_run
1.08±0.02ms	1.09±0.02ms	1.01	benchmarks.NestedFrameReduce.time_run

Click here to view all benchmarks.

codecov · 2026-03-03T19:10:58Z

Codecov Report

❌ Patch coverage is 91.66667% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.61%. Comparing base (436dda2) to head (b7fc60f).

Files with missing lines	Patch %	Lines
src/nested_pandas/series/utils.py	86.76%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #462      +/-   ##
==========================================
- Coverage   95.97%   95.61%   -0.37%     
==========================================
  Files          20       20              
  Lines        2286     2324      +38     
==========================================
+ Hits         2194     2222      +28     
- Misses         92      102      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dougbrn

I think this looks like a reasonable implementation, I have a couple of thoughts/comments:

Expectedly, there's a performance hit to this change (~10% on two of our benchmarks), and it sounds like you have use cases where you've run into this, but it does hurt to take a hit like this for all cases because of incompatibility at the edges.
Have you considered some kind of pandas.options parallel here for us to swap the backend? Probably this is a can of worms, but didn't know if you had thought about it at all.
As to the default value for large_list kwarg, I don't know I could see arguments for both. I liked False initially for minimal disruption of potential downstream workflows, but not sure if invoking the downcast hits performance at all in these cases? Default True seems nice in that the only reason someone who try to move off of it would be for fine-tuning performance (again if that even provides a benefit), or downstream compatibility.

hombit · 2026-03-04T16:07:23Z

Thank you, @dougbrn!

Expectedly, there's a performance hit to this change (~10% on two of our benchmarks), and it sounds like you have use cases where you've run into this, but it does hurt to take a hit like this for all cases because of incompatibility at the edges.

Oh, I missed it, it is a very good point! Let's see if I can do anything to improve the performance. I actually believe that this edge-case is very important from the perspective of large-catalog analysis with LSDB. We can also think about alternative designs, see a comment bellow and in the PR description.

Have you considered some kind of pandas.options parallel here for us to swap the backend? Probably this is a can of worms, but didn't know if you had thought about it at all.

I don't like pandas.options, it is too implicit. It would also be very hard to test and debug, both on our and the user's side.

As to the default value for large_list kwarg, I don't know I could see arguments for both. I liked False initially for minimal disruption of potential downstream workflows, but not sure if invoking the downcast hits performance at all in these cases? Default True seems nice in that the only reason someone who try to move off of it would be for fine-tuning performance (again if that even provides a benefit), or downstream compatibility.

I think I'll be fine with large_list=False by default. The only downside is that a pipeline debugged on a small dataset would unexpectedly fail on a large dataset, where large_list=True would actually be required.

Meta-comment
One more alternative design is supporting both LargeList and List on the Dtype/ExtensionArray level. But it makes the user interface much trickier. Another reason I think LargeList by default is good is that Polars switched to it after trying with List for a while; I think we can trust their experience.

hombit · 2026-03-06T22:32:39Z

I'm converting this to draft and working on the "chunking" alternative.

hombit · 2026-04-07T21:20:33Z

After the discussions we go with this approach, but with large_list=False by default (with better errors when it would fail)

github-actions · 2026-04-09T17:24:49Z

Pandas Nightly Test Results (Python 3.11)

470 tests +3 453 ✅ +3 17s ⏱️ ±0s
1 suites ±0 0 💤 ±0
1 files ±0 17 ❌ ±0

For more details on these failures, see this check.

Results for commit b7fc60f. ± Comparison against base commit 436dda2.

♻️ This comment has been updated with latest results.

dougbrn

Overall looks great, thanks for doing so much digging on this! Just one nit to up code coverage which is not a blocking request if you don't have much time

dougbrn · 2026-04-13T21:21:39Z

src/nested_pandas/series/utils.py

+    )
+
+
+def zero_align_offsets(array: pa.LargeListArray | pa.StructArray) -> pa.LargeListArray | pa.StructArray:


I noticed that this doesn't have test coverage, would be nice to add to not degrade project coverage too much

Nice catch, will do!

Initial LargeList support

85eac5d

hombit requested a review from dougbrn March 3, 2026 18:58

dougbrn reviewed Mar 3, 2026

View reviewed changes

hombit marked this pull request as draft March 6, 2026 22:32

hombit mentioned this pull request Mar 24, 2026

Chunking and rechunking functionality for large datasets #475

Closed

hombit marked this pull request as ready for review April 7, 2026 21:19

hombit marked this pull request as draft April 7, 2026 21:20

large_list=False

009d820

hombit marked this pull request as ready for review April 9, 2026 17:25

hombit requested a review from dougbrn April 9, 2026 17:25

Use large_list in .to_pandas/parquet

a15ac4d

hombit enabled auto-merge (squash) April 9, 2026 18:46

dougbrn approved these changes Apr 13, 2026

View reviewed changes

Merge branch 'main' into large-list

b7fc60f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change internal array representation to `LargeListArray`#462

Change internal array representation to `LargeListArray`#462
hombit wants to merge 4 commits intomainfrom
large-list

hombit commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

dougbrn left a comment

Uh oh!

hombit commented Mar 4, 2026 •

edited

Loading

Uh oh!

hombit commented Mar 6, 2026

Uh oh!

hombit commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

dougbrn left a comment

Uh oh!

dougbrn Apr 13, 2026

Uh oh!

hombit Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		)


		def zero_align_offsets(array: pa.LargeListArray \| pa.StructArray) -> pa.LargeListArray \| pa.StructArray:

Conversation

hombit commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

hombit commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hombit commented Mar 6, 2026

Uh oh!

hombit commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pandas Nightly Test Results (Python 3.11)

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

dougbrn Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

hombit Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hombit commented Mar 3, 2026 •

edited

Loading

github-actions bot commented Mar 3, 2026 •

edited

Loading

codecov bot commented Mar 3, 2026 •

edited

Loading

hombit commented Mar 4, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading