-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays #53822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP][SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays #53822
Conversation
JIRA Issue Information=== Bug SPARK-55056 === This comment was automatically generated by GitHub Actions |
1f8c172 to
33ef752
Compare
| // SPARK-55056: Arrow format requires ListArray offset buffer to have N+1 entries. | ||
| // Even when N=0, the buffer must contain [0]. Initialize offset buffer at construction | ||
| // to ensure it exists even if no elements are written. | ||
| valueVector.getOffsetBuffer.setInt(0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might the offset buffer be empty? Should we check if it is allocated with required size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should never be empty. according to arrow's columnar documentation and its example, the List offset is always N+1. As a list contains at least 0 elements, its offset is at least 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to check the allocated size.
The offset buffer is guaranteed to be allocated at this point. In ArrowWriter.create(), we call vector.allocateNew() before createFieldWriter():
def create(root: VectorSchemaRoot): ArrowWriter = {
val children = root.getFieldVectors().asScala.map { vector =>
vector.allocateNew() // allocates all buffers including nested children
createFieldWriter(vector) }
...
}
Arrow's ListVector.allocateNew() recursively allocates buffers for all nested child vectors, so when the ArrayWriter constructor runs, the offset buffer already exists with sufficient capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when allocating offset buffer for ListVector, it should already set zero to the index 0.
|
We fixed it in the upstream arrow-java apache/arrow-java#967 |
What changes were proposed in this pull request?
Initialize the Arrow ListArray offset buffer at
ArrayWriterconstruction and after reset to ensure it contains[0]even when no elements are written.Why are the changes needed?
Arrow format requires ListArray offset buffer to have N+1 entries. Even when N=0, the buffer must contain
[0].When an outer array is empty, nested
ArrayWriters are never invoked, so their count stays 0. Arrow Java'sgetBufferSizeFor(0)returns 0, causing the offset buffer to be omitted in IPC serialization — violating Arrow spec. This causes SIGSEGV when PyArrow tries to read the malformed Arrow data.The fix directly initializes the offset buffer with
valueVector.getOffsetBuffer.setInt(0, 0)at construction and after reset, ensuring the buffer exists regardless of whether any elements are written.Does this PR introduce any user-facing change?
Yes.
toPandas()on triple-nested empty arrays no longer crashes.How was this patch tested?
ArrowWriterSuitetest_arrow.pyWas this patch authored or co-authored using generative AI tooling?
No