Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests that arrow IPC data is validated #7096

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 7, 2025

Which issue does this PR close?

Rationale for this change

To test disabling validation in the IPC reader I first need to show there is validation actually occuring.

It is also probably good in general to have tests showing we validate data when read

What changes are included in this PR?

Add tests that show when invalid arrow data is written to IPC files, the StreamReader, FileReader and FileDecoder catch and verify those errors

Are there any user-facing changes?

This is entirely tests, no changes in functionality

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 7, 2025
@@ -1744,27 +1745,73 @@ mod tests {
});
}

fn roundtrip_ipc(rb: &RecordBatch) -> RecordBatch {
/// Write the record batch to an in-memory buffer in IPC File format
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just refactored these helpers into smaller chunks


/// Return the first record batch read from the IPC File buffer
/// using the FileDecoder API
fn read_ipc_with_decoder(buf: Vec<u8>) -> Result<RecordBatch, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time I find myself using FileDecoder I end up copy/pasting the example.

I am starting to think having something like

/// Incrementally decodes [`RecordBatch`]es from an IPC file stored in a Arrow
/// [`Buffer`] using the [`FileDecoder`] API.
///
/// This is a wrapper around the example in the `FileDecoder` which handles the
/// low level interaction with the Arrow IPC format.
struct IPCBufferDecoder {
/// Memory (or memory mapped) Buffer with the data
buffer: Buffer,
/// Decoder that reads Arrays that refers to the underlying buffers
decoder: FileDecoder,
/// Location of the batches within the buffer
batches: Vec<Block>,
}
in the actual crate might be useful

(given I am about to go copy/paste it again into comet...)

@@ -2492,4 +2539,109 @@ mod tests {
assert_eq!(decoded_batch.expect("Failed to read RecordBatch"), batch);
});
}

#[test]
fn test_validation_of_invalid_list_array() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on how I can make an invalid PrimitiveArray would be most apprecaited.

Anything i tried to mismatch the len and the actual data, resulted in panics in bounds checks (even with the try_new_unchecked).

I think it is probably a good thing that it is so hard to create invalid arrays but it would be nice to test this path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayData will normally let you do various inadvisable things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I tried to do was make null buffers that had different sizes than the underlying values -- that actually failed the ArrayData creation checks, but when it was re-read via IPC the resulting arrays were fine.

Maybe it was due to padding or something. I'll play around with it some more

let err = read_ipc(&buf).unwrap_err();
assert_eq!(err.to_string(), expected_err);

// TODO verify there is no error when validation is disabled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to add a few more lines here when we disable validation

@alamb alamb marked this pull request as ready for review February 7, 2025 17:53
@alamb alamb changed the title Tests for IPC validation Add tests that arrow IPC data is validated Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants