Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Jan 15, 2026

What changes were proposed in this pull request?

Initialize the Arrow ListArray offset buffer at ArrayWriter construction and after reset to ensure it contains [0] even when no elements are written.

Why are the changes needed?

Arrow format requires ListArray offset buffer to have N+1 entries. Even when N=0, the buffer must contain [0].

When an outer array is empty, nested ArrayWriters are never invoked, so their count stays 0. Arrow Java's getBufferSizeFor(0) returns 0, causing the offset buffer to be omitted in IPC serialization — violating Arrow spec. This causes SIGSEGV when PyArrow tries to read the malformed Arrow data.

The fix directly initializes the offset buffer with valueVector.getOffsetBuffer.setInt(0, 0) at construction and after reset, ensuring the buffer exists regardless of whether any elements are written.

Does this PR introduce any user-facing change?

Yes. toPandas() on triple-nested empty arrays no longer crashes.

How was this patch tested?

  • 2 Scala unit tests in ArrowWriterSuite
  • 4 Python integration tests in test_arrow.py

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions
Copy link

JIRA Issue Information

=== Bug SPARK-55056 ===
Summary: toPandas() crashes with SIGSEGV on nested empty arrays
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@Yicong-Huang Yicong-Huang force-pushed the SPARK-55056/fix/nested-empty-array-sigsegv branch from 1f8c172 to 33ef752 Compare January 16, 2026 01:02
// SPARK-55056: Arrow format requires ListArray offset buffer to have N+1 entries.
// Even when N=0, the buffer must contain [0]. Initialize offset buffer at construction
// to ensure it exists even if no elements are written.
valueVector.getOffsetBuffer.setInt(0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might the offset buffer be empty? Should we check if it is allocated with required size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should never be empty. according to arrow's columnar documentation and its example, the List offset is always N+1. As a list contains at least 0 elements, its offset is at least 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to check the allocated size.

The offset buffer is guaranteed to be allocated at this point. In ArrowWriter.create(), we call vector.allocateNew() before createFieldWriter():

def create(root: VectorSchemaRoot): ArrowWriter = {  
    val children = root.getFieldVectors().asScala.map { vector =>   
       vector.allocateNew()  // allocates all buffers including nested children    
       createFieldWriter(vector)  }  
    ...
}

Arrow's ListVector.allocateNew() recursively allocates buffers for all nested child vectors, so when the ArrayWriter constructor runs, the offset buffer already exists with sufficient capacity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when allocating offset buffer for ListVector, it should already set zero to the index 0.

@Yicong-Huang Yicong-Huang marked this pull request as draft January 21, 2026 22:38
@Yicong-Huang Yicong-Huang changed the title [SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays [WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays Jan 21, 2026
@Yicong-Huang
Copy link
Contributor Author

We fixed it in the upstream arrow-java apache/arrow-java#967

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants