-
Notifications
You must be signed in to change notification settings - Fork 0
19924: Support JSON arrays reader/parse for datafusion #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request adds support for reading JSON data in both newline-delimited (NDJSON) and JSON array formats across the DataFusion ecosystem. The changes include: renaming NdJsonReadOptions to JsonReadOptions with a new newline_delimited boolean field; extending JsonFormat, JsonOpener, and JsonSource with format selection capabilities; implementing a JsonArrayToNdjsonReader utility for streaming JSON array-to-NDJSON conversion; updating protobuf definitions to persist the newline_delimited flag; adding test data and comprehensive tests for JSON array parsing; and updating all dependent code paths and examples to use the new API. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for reading JSON files in array format within DataFusion, complementing the existing newline-delimited JSON capabilities. The core change involves a new streaming mechanism that efficiently converts JSON arrays to a newline-delimited format on the fly, enabling processing of large files without excessive memory usage. This is exposed through a unified Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Comprehensive Review of PR #226: JSON Arrays Reader/Parse SupportOverviewThis PR adds support for reading JSON array format 1. Code Quality and Best Practices ⭐⭐⭐⭐✅ Strengths
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for reading JSON array format in DataFusion, in addition to the existing newline-delimited JSON (NDJSON) support. This is a significant feature enhancement. The implementation includes a streaming parser for JSON arrays, which is memory-efficient. The NdJsonReadOptions has been renamed to JsonReadOptions to reflect this change, with a deprecated type alias for backward compatibility. The changes are well-tested with a comprehensive new test suite. My review focuses on the correctness and maintainability of the new implementation. I've found one high-severity issue related to error propagation and one medium-severity issue regarding code duplication.
| Err(e) => { | ||
| log::error!("Error reading JSON stream: {e}"); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an error occurs while reading from the object store stream, it is logged, but not propagated to the consumer of the RecordBatch stream. The stream will simply end, or the parser might fail with a potentially misleading JSON parsing error. The underlying I/O error should be propagated to the output stream so the user is aware of the root cause.
You can fix this by cloning the result_tx sender before spawning the tasks and using it here to send the error. For example:
// Before spawning tasks
let (result_tx, result_rx) = tokio::sync::mpsc::channel(2);
let result_tx_clone = result_tx.clone();Then use result_tx_clone in the read_task as suggested.
Err(e) => {
log::error!("Error reading JSON stream: {e}");
// Propagate the error to the output stream.
let arrow_error = arrow::error::ArrowError::ExternalError(Box::new(e));
if result_tx_clone.send(Err(arrow_error)).await.is_err() {
// Receiver has been dropped, which is fine.
}
break;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! JSON array parsing errors are currently only logged but the caller is not informed about the error. This way the usage of the partially parsed data will lead to another error. Prevents hiding the error until someone tries to use the problematic data.
| let (schema, records_consumed) = match r.payload { | ||
| #[cfg(not(target_arch = "wasm32"))] | ||
| GetResultPayload::File(file, _) => { | ||
| let decoder = file_compression_type.convert_read(file)?; | ||
| let mut reader = BufReader::new(decoder); | ||
| let iter = ValueIter::new(&mut reader, None); | ||
| infer_json_schema_from_iterator(iter.take_while(|_| take_while()))? | ||
| let reader = BufReader::new(decoder); | ||
|
|
||
| if newline_delimited { | ||
| // NDJSON: use ValueIter directly | ||
| let iter = ValueIter::new(reader, None); | ||
| let mut count = 0; | ||
| let schema = | ||
| infer_json_schema_from_iterator(iter.take_while(|_| { | ||
| let should_take = count < records_to_read; | ||
| if should_take { | ||
| count += 1; | ||
| } | ||
| should_take | ||
| }))?; | ||
| (schema, count) | ||
| } else { | ||
| // JSON array format: use streaming converter | ||
| infer_schema_from_json_array(reader, records_to_read)? | ||
| } | ||
| } | ||
| GetResultPayload::Stream(_) => { | ||
| let data = r.bytes().await?; | ||
| let decoder = file_compression_type.convert_read(data.reader())?; | ||
| let mut reader = BufReader::new(decoder); | ||
| let iter = ValueIter::new(&mut reader, None); | ||
| infer_json_schema_from_iterator(iter.take_while(|_| take_while()))? | ||
| let reader = BufReader::new(decoder); | ||
|
|
||
| if newline_delimited { | ||
| let iter = ValueIter::new(reader, None); | ||
| let mut count = 0; | ||
| let schema = | ||
| infer_json_schema_from_iterator(iter.take_while(|_| { | ||
| let should_take = count < records_to_read; | ||
| if should_take { | ||
| count += 1; | ||
| } | ||
| should_take | ||
| }))?; | ||
| (schema, count) | ||
| } else { | ||
| // JSON array format: use streaming converter | ||
| infer_schema_from_json_array(reader, records_to_read)? | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for inferring the schema for newline-delimited JSON is duplicated between the GetResultPayload::File and GetResultPayload::Stream match arms. This can be refactored into a helper function to improve maintainability and reduce code duplication. The helper function could take a generic Read trait object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! There is some duplicated code in the branches for processing files and streams from the ObjectStore that could be extracted into a helper function to prevent double maintenance and possible diversion
🤖 Augment PR SummarySummary: This PR extends DataFusion’s JSON reader to support both newline-delimited JSON (NDJSON) and JSON array files. Changes:
Technical Notes: JSON array files are parsed by streaming conversion to NDJSON to avoid loading full arrays into memory, while NDJSON remains the default behavior. 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.output_pos >= self.output_filled { | ||
| self.fill_output_buffer()?; | ||
| if self.output_filled == 0 { | ||
| return Ok(0); // EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At EOF this returns Ok(0) even if the input JSON array was never closed or had trailing non-whitespace content, which can make malformed JSON silently parse as valid NDJSON. Since validate_complete() already tracks this, consider surfacing an InvalidData error on EOF when state != Done or has_trailing_content is true.
Severity: medium
Other Locations
datafusion/datasource-json/src/source.rs:316datafusion/datasource-json/src/source.rs:394datafusion/datasource-json/src/file_format.rs:226
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! Every usage of JsonArrayToNdjsonReader to read JSON array file/stream should be followed by calling validate_complete() to make sure that the content is well formed JSON array and there are not any problems with it. Prevents reading/parsing JSON that is not following the JSON specification.
| while let Some(chunk) = decompressed_stream.next().await { | ||
| match chunk { | ||
| Ok(bytes) => { | ||
| if byte_tx.send(bytes).is_err() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This std::sync::mpsc::sync_channel send call blocks the async task when the buffer is full, which can stall a Tokio worker thread under backpressure. Consider using an async channel (tokio::sync::mpsc) or moving the blocking send into a blocking context.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| } | ||
| } | ||
| Err(e) => { | ||
| log::error!("Error reading JSON stream: {e}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On read/decompression errors this path only logs and then ends the stream, which can look like a clean EOF and hide the real failure cause. Consider propagating the error to the consumer (e.g., via the result_tx channel) so queries fail with the underlying error.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! JSON array parsing errors are currently only logged but the caller is not informed about the error. This way the usage of the partially parsed data will lead to another error. Prevents hiding the error until someone tries to use the problematic data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| Err(e) => { | ||
| log::error!("Error reading JSON stream: {e}"); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stream errors silently swallowed causing partial results
High Severity
When streaming JSON array format from remote storage (S3, etc.), errors from the object store stream are only logged and then the task breaks, but the error is never propagated to the consumer. When byte_tx is dropped, ChannelReader returns EOF (0 bytes), causing parse_task to complete normally with whatever partial data was processed. This results in queries silently returning incomplete results when network errors occur, rather than failing with an error. The read_task has no way to communicate errors to the consumer since result_tx is only accessible from parse_task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! JSON array parsing errors are currently only logged but the caller is not informed about the error. This way the usage of the partially parsed data will lead to another error. Prevents hiding the error until someone tries to use the problematic data.
| } | ||
| should_take | ||
| }))?; | ||
| (schema, count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated NDJSON schema inference logic
Medium Severity
The NDJSON schema inference logic is duplicated between the File branch (lines 292-304) and Stream branch (lines 315-326) in infer_schema(). Both contain identical code using ValueIter, infer_json_schema_from_iterator, and record counting. A helper function like infer_schema_from_ndjson(reader, max_records) would eliminate this duplication, consistent with the existing infer_schema_from_json_array helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:bug; feedback: The Bugbot AI reviewer is correct! There is some duplicated code in the branches for processing files and streams from the ObjectStore that could be extracted into a helper function to prevent double maintenance and possible diversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@datafusion/datasource-json/src/source.rs`:
- Around line 370-388: The read task spawned by SpawnedTask::spawn currently
logs read errors and breaks, which hides failures from the consumer; change the
loop handling for decompressed_stream in the spawned task to send any Err(e)
into the existing error channel (error_tx) instead of only logging, then break
so the byte channel closes; ensure the surrounding code that builds the consumer
stream (e.g., ChannelReader / result stream) reads from error_rx and returns or
yields the error to callers so consumers see the read failure instead of a
silent EOF.
In `@datafusion/datasource-json/src/utils.rs`:
- Around line 109-133: The reader currently silently discards any bytes before
the opening '['; modify JsonArrayToNdjsonReader to detect and reject
non‑whitespace bytes that appear before the first '[' by (1) adding a boolean
flag (e.g., leading_non_whitespace) or reusing/augmenting an existing state to
record if any non‑whitespace was seen while in the pre‑array state, (2) set that
flag when scanning bytes before the initial '[' if a byte is not ASCII
whitespace, and (3) update validate_complete() to return an InvalidData Err when
the flag is set (i.e., trailing/preceding junk was observed) so inputs like
"junk[...]" are rejected; ensure the detection happens in the same code paths
that currently skip until '[' (refer to JsonArrayToNdjsonReader, its pre‑array
handling code and validate_complete()) and that tests covering leading junk are
added/updated.
- Around line 162-196: The JsonArrayToNdjsonReader instances are never validated
at the end of consumption, so malformed arrays slip through; call
JsonArrayToNdjsonReader::validate_complete() on each production path that
constructs and hands the reader off (i.e., the paths that pass the reader into
the Arrow ReaderBuilder and the path that wraps it with ValueIter) and propagate
any std::io::Error returned. Concretely: after the reader has been fully
consumed (or immediately before passing it to ReaderBuilder or constructing
ValueIter), invoke reader.validate_complete()? and return or map the error into
the surrounding Result so missing closing ']', unterminated strings, or trailing
content are reported. Ensure this is added to the buffered JSON array path, the
streaming JSON array path, and the schema inference path that uses ValueIter.
🧹 Nitpick comments (3)
datafusion/datasource-json/Cargo.toml (1)
46-50: Align log/serde_json to workspace dependencies.Using explicit versions here can drift from workspace policy and create version skew. Prefer
workspace = truefor consistency.🔧 Suggested change
-log = "0.4.29" +log = { workspace = true } ... -serde_json = "1.0.149" +serde_json = { workspace = true }datafusion/core/src/datasource/file_format/json.rs (1)
59-64: UsePathBuf::joinfor cross-platform temp paths.
String concatenation with"/"is less portable thanPath::join.Suggested change
- let path = format!("{}/test.json", tmp_dir.path().to_string_lossy()); - std::fs::write(&path, content).unwrap(); - (tmp_dir, path) + let path = tmp_dir.path().join("test.json"); + std::fs::write(&path, content).unwrap(); + (tmp_dir, path.to_string_lossy().to_string())datafusion/datasource-json/src/file_format.rs (1)
286-331: Consider extracting duplicated NDJSON schema inference logic.The NDJSON inference code in the
Filebranch (lines 292-304) andStreambranch (lines 315-326) is nearly identical. Consider extracting it into a helper function similar toinfer_schema_from_json_array:fn infer_schema_from_ndjson<R: Read>(reader: R, max_records: usize) -> Result<(Schema, usize)> { let iter = ValueIter::new(reader, None); let mut count = 0; let schema = infer_json_schema_from_iterator(iter.take_while(|_| { let should_take = count < max_records; if should_take { count += 1; } should_take }))?; Ok((schema, count)) }This would reduce code duplication and improve maintainability.
| // Async task: read from object store stream and send bytes to channel | ||
| // Store the SpawnedTask to keep it alive until stream is dropped | ||
| let read_task = SpawnedTask::spawn(async move { | ||
| tokio::pin!(decompressed_stream); | ||
| while let Some(chunk) = decompressed_stream.next().await { | ||
| match chunk { | ||
| Ok(bytes) => { | ||
| if byte_tx.send(bytes).is_err() { | ||
| break; // Consumer dropped | ||
| } | ||
| } | ||
| Err(e) => { | ||
| log::error!("Error reading JSON stream: {e}"); | ||
| break; | ||
| } | ||
| } | ||
| } | ||
| // byte_tx dropped here, signals EOF to ChannelReader | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error in read task is silently dropped.
When an error occurs reading from the object store stream (line 381-384), the error is only logged and the task breaks. The consumer will see the stream end without knowing an error occurred. Consider propagating the error:
🛡️ Proposed fix to propagate read errors
- // Async task: read from object store stream and send bytes to channel
- // Store the SpawnedTask to keep it alive until stream is dropped
- let read_task = SpawnedTask::spawn(async move {
+ // Channel for read errors
+ let (error_tx, mut error_rx) = tokio::sync::oneshot::channel();
+
+ let read_task = SpawnedTask::spawn(async move {
tokio::pin!(decompressed_stream);
while let Some(chunk) = decompressed_stream.next().await {
match chunk {
Ok(bytes) => {
if byte_tx.send(bytes).is_err() {
break; // Consumer dropped
}
}
Err(e) => {
log::error!("Error reading JSON stream: {e}");
+ let _ = error_tx.send(e);
break;
}
}
}
});Then check error_rx in the result stream to surface the error to the consumer.
🤖 Prompt for AI Agents
In `@datafusion/datasource-json/src/source.rs` around lines 370 - 388, The read
task spawned by SpawnedTask::spawn currently logs read errors and breaks, which
hides failures from the consumer; change the loop handling for
decompressed_stream in the spawned task to send any Err(e) into the existing
error channel (error_tx) instead of only logging, then break so the byte channel
closes; ensure the surrounding code that builds the consumer stream (e.g.,
ChannelReader / result stream) reads from error_rx and returns or yields the
error to callers so consumers see the read failure instead of a silent EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! JSON array parsing errors are currently only logged but the caller is not informed about the error. This way the usage of the partially parsed data will lead to another error. Prevents hiding the error until someone tries to use the problematic data.
| pub struct JsonArrayToNdjsonReader<R: Read> { | ||
| /// Inner reader - we use R directly (not `BufReader<R>`) for SyncIoBridge compatibility | ||
| inner: R, | ||
| state: JsonArrayState, | ||
| /// Tracks nesting depth of `{` and `[` to identify top-level commas | ||
| depth: i32, | ||
| /// Whether we're currently inside a JSON string | ||
| in_string: bool, | ||
| /// Whether the next character is escaped (after `\`) | ||
| escape_next: bool, | ||
| /// Input buffer - stores raw bytes read from inner reader | ||
| input_buffer: Vec<u8>, | ||
| /// Current read position in input buffer | ||
| input_pos: usize, | ||
| /// Number of valid bytes in input buffer | ||
| input_filled: usize, | ||
| /// Output buffer - stores transformed NDJSON bytes | ||
| output_buffer: Vec<u8>, | ||
| /// Current read position in output buffer | ||
| output_pos: usize, | ||
| /// Number of valid bytes in output buffer | ||
| output_filled: usize, | ||
| /// Whether trailing non-whitespace content was detected after ']' | ||
| has_trailing_content: bool, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reject non‑whitespace before the opening '['.
Line 205 currently ignores any byte that isn’t [, even if it’s non‑whitespace. That means inputs like junk[{"a":1}] are silently accepted and the leading junk discarded, producing valid NDJSON from invalid JSON. Consider tracking leading non‑whitespace and surfacing InvalidData in validate_complete().
🔧 Suggested fix
pub struct JsonArrayToNdjsonReader<R: Read> {
/// Whether trailing non-whitespace content was detected after ']'
has_trailing_content: bool,
+ /// Whether non-whitespace content was detected before '['
+ has_leading_content: bool,
}
@@
pub fn with_capacity(reader: R, capacity: usize) -> Self {
Self {
@@
has_trailing_content: false,
+ has_leading_content: false,
}
}
@@
pub fn validate_complete(&self) -> std::io::Result<()> {
+ if self.has_leading_content {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ "Malformed JSON array: unexpected content before '['",
+ ));
+ }
if self.depth != 0 {
@@
fn process_byte(&mut self, byte: u8) -> Option<u8> {
match self.state {
JsonArrayState::Start => {
// Looking for the opening '[', skip whitespace
if byte == b'[' {
self.state = JsonArrayState::InArray;
- }
- // Skip whitespace and the '[' itself
+ return None;
+ }
+ // Track non-whitespace before '[' so we can reject invalid input
+ if !byte.is_ascii_whitespace() {
+ self.has_leading_content = true;
+ }
None
}Also applies to: 145-159, 162-196, 199-210
🤖 Prompt for AI Agents
In `@datafusion/datasource-json/src/utils.rs` around lines 109 - 133, The reader
currently silently discards any bytes before the opening '['; modify
JsonArrayToNdjsonReader to detect and reject non‑whitespace bytes that appear
before the first '[' by (1) adding a boolean flag (e.g., leading_non_whitespace)
or reusing/augmenting an existing state to record if any non‑whitespace was seen
while in the pre‑array state, (2) set that flag when scanning bytes before the
initial '[' if a byte is not ASCII whitespace, and (3) update
validate_complete() to return an InvalidData Err when the flag is set (i.e.,
trailing/preceding junk was observed) so inputs like "junk[...]" are rejected;
ensure the detection happens in the same code paths that currently skip until
'[' (refer to JsonArrayToNdjsonReader, its pre‑array handling code and
validate_complete()) and that tests covering leading junk are added/updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The JSON array reader/parser checks whether there is a trailing content after the closing ']' character but it does not do the same for leading content before the opening '[' character. Prevents reading invalid JSON array data.
| /// Check if the JSON array was properly terminated. | ||
| /// | ||
| /// This should be called after all data has been read. | ||
| /// | ||
| /// Returns an error if: | ||
| /// - Unbalanced braces/brackets (depth != 0) | ||
| /// - Unterminated string | ||
| /// - Missing closing `]` | ||
| /// - Unexpected trailing content after `]` | ||
| pub fn validate_complete(&self) -> std::io::Result<()> { | ||
| if self.depth != 0 { | ||
| return Err(std::io::Error::new( | ||
| std::io::ErrorKind::InvalidData, | ||
| "Malformed JSON array: unbalanced braces or brackets", | ||
| )); | ||
| } | ||
| if self.in_string { | ||
| return Err(std::io::Error::new( | ||
| std::io::ErrorKind::InvalidData, | ||
| "Malformed JSON array: unterminated string", | ||
| )); | ||
| } | ||
| if self.state != JsonArrayState::Done { | ||
| return Err(std::io::Error::new( | ||
| std::io::ErrorKind::InvalidData, | ||
| "Incomplete JSON array: expected closing bracket ']'", | ||
| )); | ||
| } | ||
| if self.has_trailing_content { | ||
| return Err(std::io::Error::new( | ||
| std::io::ErrorKind::InvalidData, | ||
| "Malformed JSON: unexpected trailing content after ']'", | ||
| )); | ||
| } | ||
| Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find construction and validation call sites for JsonArrayToNdjsonReader
rg -n "JsonArrayToNdjsonReader" -S
rg -n "validate_complete" -SRepository: martin-augment/datafusion
Length of output: 4602
🏁 Script executed:
# Check source.rs around the reader construction sites
sed -n '310,370p' datafusion/datasource-json/src/source.rs
# Check file_format.rs around the reader construction site
sed -n '220,250p' datafusion/datasource-json/src/file_format.rsRepository: martin-augment/datafusion
Length of output: 4108
🏁 Script executed:
# Look at the full implementation of JsonArrayToNdjsonReader
sed -n '109,200p' datafusion/datasource-json/src/utils.rs
# Also check the Read and BufRead trait implementations
sed -n '326,360p' datafusion/datasource-json/src/utils.rsRepository: martin-augment/datafusion
Length of output: 4588
🏁 Script executed:
# Search for Drop implementation and any post-processing validation
sed -n '200,330p' datafusion/datasource-json/src/utils.rs
# Check how source.rs actually uses the arrow_reader after building
sed -n '310,330p' datafusion/datasource-json/src/source.rsRepository: martin-augment/datafusion
Length of output: 5925
🏁 Script executed:
# Search for Drop implementation
rg "impl.*Drop.*JsonArrayToNdjsonReader" -A 20
# Also check if the issue is addressed in tests with invalid data
sed -n '566,595p' datafusion/datasource-json/src/utils.rsRepository: martin-augment/datafusion
Length of output: 1362
Add validate_complete() call to all production code paths using JsonArrayToNdjsonReader.
Incomplete or malformed JSON arrays silently pass validation when validate_complete() is not called. The following production paths construct JsonArrayToNdjsonReader but never validate:
source.rs:316–320: JSON array buffered read path passes reader to Arrow ReaderBuilder without validationsource.rs:394: JSON array streaming path passes reader to Arrow ReaderBuilder without validationfile_format.rs:226: Schema inference path passes reader to ValueIter without validation
Missing closing ], unterminated strings, and trailing junk are only detected by validate_complete(). Without explicit calls after the reader is exhausted, invalid input completes successfully.
🤖 Prompt for AI Agents
In `@datafusion/datasource-json/src/utils.rs` around lines 162 - 196, The
JsonArrayToNdjsonReader instances are never validated at the end of consumption,
so malformed arrays slip through; call
JsonArrayToNdjsonReader::validate_complete() on each production path that
constructs and hands the reader off (i.e., the paths that pass the reader into
the Arrow ReaderBuilder and the path that wraps it with ValueIter) and propagate
any std::io::Error returned. Concretely: after the reader has been fully
consumed (or immediately before passing it to ReaderBuilder or constructing
ValueIter), invoke reader.validate_complete()? and return or map the error into
the surrounding Result so missing closing ']', unterminated strings, or trailing
content are reported. Ensure this is added to the buffered JSON array path, the
streaming JSON array path, and the schema inference path that uses ValueIter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! Every usage of JsonArrayToNdjsonReader to read JSON array file/stream should be followed by calling validate_complete() to make sure that the content is well formed JSON array and there are not any problems with it. Prevents reading/parsing JSON that is not following the JSON specification.
value:useful; category:bug; feedback: The Claude AI reviewer is correct! Every usage of JsonArrayToNdjsonReader to read JSON array file/stream should be followed by calling |
value:useful; category:bug; feedback: The Claude AI reviewer is correct! Every usage of JsonArrayToNdjsonReader to read JSON array file/stream should be followed by calling |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! There is some duplicated code in the branches for processing files and streams from the ObjectStore that could be extracted into a helper function to prevent double maintenance and possible diversion |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! By using workspace=true the crate will use the same version of the dependencies as all other crates in the workspace. Prevents using different versions of the log and serde_json crates in one of the project crates |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Using '/' as a file separator will fail on Windows. It is better to use PathBuf::join() instead, because it knows what to do depending on the runtime OS. |


19924: To review by AI