Skip to content

Conversation

@martin-augment
Copy link
Owner

19924: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands DataFusion's capabilities for ingesting JSON data by introducing robust support for JSON array format files. By implementing a streaming conversion to newline-delimited JSON, the system can now efficiently process large JSON array datasets without excessive memory consumption. This enhancement provides greater flexibility for users working with diverse JSON data structures.

Highlights

  • JSON Array Format Support: DataFusion now supports reading JSON files in array format (e.g., [{}, {}]) in addition to the existing newline-delimited JSON (NDJSON) format. A new newline_delimited option has been introduced to control this behavior, defaulting to true for NDJSON.
  • Streaming Conversion for Efficiency: A new JsonArrayToNdjsonReader utility has been implemented. This reader performs a streaming, memory-efficient conversion of JSON array files into NDJSON format on-the-fly, avoiding the need to load the entire file into memory for parsing.
  • Enhanced Schema Inference and Querying: The schema inference mechanism for JSON files has been updated to correctly handle both NDJSON and JSON array formats, including proper record counting. New tests cover various scenarios such as nested structures, null values, unnesting, and compressed JSON array files.
  • Configuration and API Updates: The JsonOptions in datafusion-common, NdJsonReadOptions in datafusion-core, and related protobuf definitions have been extended with the newline_delimited flag. The JsonOpener and JsonSource components have been adapted to utilize this new option, with a specific check to prevent range-based scanning for JSON array formats.
  • SQL Logic Test Integration: New SQL logic tests have been added to validate the functionality of reading JSON array files using the format.newline_delimited='false' option and to confirm that attempting to read them without this option (which defaults to NDJSON) results in an error.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 26, 2026

Walkthrough

This pull request adds support for reading JSON array format alongside the existing newline-delimited JSON (NDJSON) support. A new newline_delimited configuration option is introduced to JsonOptions with a default value of true. A new JsonArrayToNdjsonReader utility converts JSON arrays into NDJSON through streaming transformation. The JsonFormat, JsonOpener, JsonSource, and NdJsonReadOptions types are updated to propagate the newline_delimited flag. Changes include updates to protobuf definitions for serialization support, new test data files, test cases, and modifications to schema inference logic to handle both JSON formats.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 26, 2026

Pull Request Review: JSON Array Support (#211)

Summary

This PR adds support for reading JSON array format [{...}, {...}] in addition to the existing newline-delimited JSON (NDJSON) format. The implementation introduces a streaming converter that transforms JSON arrays to NDJSON on-the-fly, avoiding the need to load entire files into memory.

Code Quality and Architecture

✅ Strengths

  1. Excellent streaming implementation - The JsonArrayToNdjsonReader is well-designed with:

    • State machine pattern for parsing (Start, InArray, Done)
    • Proper depth tracking for nested structures
    • String escape sequence handling
    • Memory-efficient buffer management (64KB chunks)
    • Implements both Read and BufRead traits correctly
  2. Comprehensive documentation - Great ASCII art diagrams and clear comments explaining:

    • The architecture and data flow
    • Memory efficiency comparison table
    • Transformation rules
    • API usage examples
  3. Good test coverage - Unit tests cover:

    • Simple arrays
    • Nested objects and arrays
    • Strings with special characters
    • Escaped quotes
    • Empty arrays
    • Single elements
    • Whitespace handling
  4. Proper configuration integration - The newline_delimited flag is:

    • Added to JsonOptions config with clear documentation
    • Properly serialized in protobuf
    • Threaded through all layers (format, source, opener)

⚠️ Potential Issues

1. Security/Robustness (datafusion/datasource-json/src/utils.rs)

Issue: Malformed JSON validation only happens after processing completes

The validate_complete() method is defined but never called in the actual processing flow. The reader can reach EOF with unbalanced braces without raising an error. Arrow's JSON reader might catch this, but it's better to fail fast with a clear error.

Recommendation: Call validate_complete() after reading completes in JsonOpener::open() or handle validation during the read process.

2. Breaking API Change (datafusion-examples/examples/custom_data_source/csv_json_opener.rs)

Issue: Adding a required boolean parameter to JsonOpener::new()

This breaks backward compatibility for anyone constructing JsonOpener directly. While this is likely internal API, it should use a builder pattern or default.

Recommendation: Consider using a builder pattern or with_newline_delimited() method instead of adding required parameters.

3. Edge Case: Depth Underflow (datafusion/datasource-json/src/utils.rs:199-210)

Issue: Depth can potentially go negative

If input has } or ] without matching opener, depth becomes negative. While invalid JSON, it could cause unexpected behavior.

Recommendation: Add validation to prevent negative depth:

b'}' => {
    if self.depth > 0 {
        self.depth -= 1;
    }
    Some(byte)
}

4. Documentation Inconsistency (datafusion/common/src/config.rs:3070)

Issue: Indentation is off for the new config field - the comment uses 7 spaces instead of 8 for alignment.

5. Range-based Scanning Error Handling (datafusion/datasource-json/src/source.rs:216-223)

Issue: Good error message, but the check happens in open() rather than during planning

The error only occurs when trying to execute, not during planning. Users might see this error late in query execution.

Recommendation: Consider adding validation earlier, perhaps in FileSource::create_file_opener() or when creating the scan plan.

Performance Considerations

✅ Good Decisions

  1. Streaming approach - Avoids loading entire file into memory (important for large files)
  2. 64KB buffer size - Good balance between syscall overhead and memory usage
  3. Character-by-character transformation - Minimal overhead, no additional parsing
  4. Direct integration with Arrow reader - No intermediate allocations

⚠️ Concerns

Stream payload handling (datafusion/datasource-json/src/source.rs:298-322)

  • For stream payloads with JSON arrays, the code still collects all bytes into Vec before processing
  • This defeats the memory efficiency benefit for remote files
  • Impact: Large JSON array files from object stores will still consume significant memory

Suggestion: Consider streaming directly to JsonArrayToNdjsonReader if possible, or document this limitation.

Test Coverage

✅ Good Coverage

  • Unit tests for the streaming converter
  • SQL logic tests for end-to-end functionality
  • Test files for both empty and non-empty arrays
  • Example code updated

⚠️ Missing Tests

  1. Error cases:

    • Malformed JSON arrays (missing closing ], unbalanced braces)
    • Very large JSON arrays (memory usage validation)
    • Mixed format files (starting with [ but actually NDJSON)
  2. Integration tests:

    • Range-based scanning with array format (should fail with clear error)
    • Compressed JSON arrays
    • JSON arrays with different data types
    • Schema inference from JSON arrays

Security Concerns

✅ No Critical Issues

The implementation doesn't introduce obvious security vulnerabilities:

  • No unsafe code
  • No command injection risks
  • Proper error handling for I/O operations

⚠️ Minor Concerns

  1. Resource exhaustion - Malformed JSON with unclosed arrays could cause the reader to process data indefinitely until EOF
  2. Memory usage for streams - Stream payloads collect all bytes, which could be a DoS vector for extremely large files

Recommendations Summary

High Priority

  1. Call validate_complete() after JSON array reading completes to catch malformed input early
  2. Add depth underflow protection to prevent negative depth values
  3. Add error handling tests for malformed JSON arrays

Medium Priority

  1. Fix indentation in config.rs documentation
  2. Consider builder pattern for JsonOpener to avoid breaking API changes
  3. Document stream memory limitation for JSON array format
  4. Add validation earlier for unsupported range-based scanning

Low Priority

  1. Add integration tests for error cases and edge conditions
  2. Consider streaming for remote files to improve memory efficiency

Overall Assessment

This is a well-implemented feature with solid architecture and good documentation. The streaming converter is elegant and memory-efficient for file-based access. The main concerns are:

  • Missing validation calls that could mask errors
  • Breaking API change in JsonOpener
  • Memory efficiency limitation for remote stream payloads

With the high-priority issues addressed, this would be ready to merge.

Rating: 7.5/10 - Good implementation with minor issues to address.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for reading JSON array format to DataFusion, which is a valuable feature enhancement. The implementation is well-structured, introducing a streaming JsonArrayToNdjsonReader for memory-efficient processing of file-based sources. The changes are comprehensive, including configuration updates, core implementation changes, protobuf modifications, and a thorough test suite.

My review focuses on improving correctness and maintainability. I've pointed out a misleading documentation comment about memory usage, a correctness issue regarding the validation of malformed JSON, and a case where garbage data after a JSON array might be ignored. Additionally, I've provided suggestions to reduce code duplication. Overall, this is a solid contribution.

/// - Unbalanced braces/brackets (depth != 0)
/// - Unterminated string
/// - Missing closing `]`
pub fn validate_complete(&self) -> std::io::Result<()> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The validate_complete method is a great addition for ensuring the JSON array is well-formed. However, it's currently unused, which could lead to silently accepting malformed JSON files (e.g., truncated files or files with unbalanced brackets). This is a correctness issue.

This method should be called when the reader is fully consumed. A good way to implement this would be to add a flag to JsonArrayToNdjsonReader to control validation on EOF, since schema inference might not read the whole file.

For example:

  1. Add a validate_on_eof: bool field to JsonArrayToNdjsonReader.
  2. Update new to pub fn new(reader: R, validate_on_eof: bool) -> Self.
  3. In fill_internal_buffer, when EOF is detected (bytes_read == 0), call validate_complete if the flag is set.
    // in fill_internal_buffer
    let bytes_read = self.inner.read(&mut raw_buf)?;
    if bytes_read == 0 {
        if self.validate_on_eof {
            self.validate_complete()?;
        }
        break; // EOF
    }
  4. Update call sites: JsonOpener::open would call JsonArrayToNdjsonReader::new(..., true), while infer_schema_from_json_array would call it with false.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Gemini AI reviewer is correct! The validate_complete() method is not used and this may lead to accepting an invalid JSON as valid. Prevents runtime errors while parsing the JSON due to missing validation while pre-processing the JSON.

Comment on lines +3084 to +3085
/// Note: JSON array format requires loading the entire file into memory.
/// For large files, newline-delimited format is recommended.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment Note: JSON array format requires loading the entire file into memory. is a bit misleading. The implementation uses a streaming converter (JsonArrayToNdjsonReader) for file-based sources, which is memory-efficient. While it's true that for Stream payloads the entire stream is buffered into memory first, for file-based sources (a very common case), it is streaming.

Consider rephrasing to be more precise. For example, you could mention that streaming sources might be buffered. The documentation in datafusion/datasource-json/src/file_format.rs is more accurate on this point: Note: JSON array format is processed using streaming conversion, which is memory-efficient even for large files.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The new version of the proposed changes still loads the complete JSON array file in the memory for GetResultPayload::Stream and this may lead to out of memory errors for huge files. The file is loaded as a stream from the ObjectStore, so it should be processed as a stream. Prevents out of memory failures when loading a big JSON array file.

Comment on lines +286 to 332
let (schema, records_consumed) = match r.payload {
#[cfg(not(target_arch = "wasm32"))]
GetResultPayload::File(file, _) => {
let decoder = file_compression_type.convert_read(file)?;
let mut reader = BufReader::new(decoder);
let iter = ValueIter::new(&mut reader, None);
infer_json_schema_from_iterator(iter.take_while(|_| take_while()))?
let reader = BufReader::new(decoder);

if newline_delimited {
// NDJSON: use ValueIter directly
let iter = ValueIter::new(reader, None);
let mut count = 0;
let schema =
infer_json_schema_from_iterator(iter.take_while(|_| {
let should_take = count < records_to_read;
if should_take {
count += 1;
}
should_take
}))?;
(schema, count)
} else {
// JSON array format: use streaming converter
infer_schema_from_json_array(reader, records_to_read)?
}
}
GetResultPayload::Stream(_) => {
let data = r.bytes().await?;
let decoder = file_compression_type.convert_read(data.reader())?;
let mut reader = BufReader::new(decoder);
let iter = ValueIter::new(&mut reader, None);
infer_json_schema_from_iterator(iter.take_while(|_| take_while()))?
let reader = BufReader::new(decoder);

if newline_delimited {
let iter = ValueIter::new(reader, None);
let mut count = 0;
let schema =
infer_json_schema_from_iterator(iter.take_while(|_| {
let should_take = count < records_to_read;
if should_take {
count += 1;
}
should_take
}))?;
(schema, count)
} else {
// JSON array format: use streaming converter
infer_schema_from_json_array(reader, records_to_read)?
}
}
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's significant code duplication between the GetResultPayload::File and GetResultPayload::Stream match arms for handling newline_delimited. You could extract the logic for choosing between NDJSON and JSON array schema inference into a helper function to reduce duplication and improve maintainability.

A new function infer_schema_from_ndjson could be created, similar to infer_schema_from_json_array.

fn infer_schema_from_ndjson<R: Read>(
    reader: R,
    max_records: usize,
) -> Result<(Schema, usize)> {
    let iter = ValueIter::new(reader, None);
    let mut count = 0;
    let schema = infer_json_schema_from_iterator(iter.take_while(|_| {
        let should_take = count < max_records;
        if should_take {
            count += 1;
        }
        should_take
    }))?;
    Ok((schema, count))
}

Then the duplicated block could be simplified to:

let reader = BufReader::new(decoder);
if newline_delimited {
    infer_schema_from_ndjson(reader, records_to_read)?
} else {
    infer_schema_from_json_array(reader, records_to_read)?
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Gemini AI reviewer is correct! There is code duplication in the blocks for File and Stream inputs which will be good to be extracted to a helper function and reused. Prevents double maintenance and bugs fixed only at one of the places

Comment on lines +260 to +281
if newline_delimited {
// NDJSON: use BufReader directly
let reader = BufReader::new(bytes);
let arrow_reader = ReaderBuilder::new(schema)
.with_batch_size(batch_size)
.build(reader)?;

Ok(futures::stream::iter(arrow_reader)
.map(|r| r.map_err(Into::into))
.boxed())
} else {
// JSON array format: wrap with streaming converter
// JsonArrayToNdjsonReader implements BufRead
let ndjson_reader = JsonArrayToNdjsonReader::new(bytes);
let arrow_reader = ReaderBuilder::new(schema)
.with_batch_size(batch_size)
.build(ndjson_reader)?;

Ok(futures::stream::iter(arrow_reader)
.map(|r| r.map_err(Into::into))
.boxed())
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is some code duplication in the GetResultPayload::File match arm for handling newline_delimited. You can refactor this to make it more concise.

Since JsonArrayToNdjsonReader implements BufRead, you could refactor this section to avoid repeating the ReaderBuilder and stream creation logic.

                    let reader: Box<dyn std::io::BufRead + Send> = if newline_delimited {
                        Box::new(std::io::BufReader::new(bytes))
                    } else {
                        Box::new(JsonArrayToNdjsonReader::new(bytes))
                    };

                    let arrow_reader = ReaderBuilder::new(schema)
                        .with_batch_size(batch_size)
                        .build(reader)?;

                    Ok(futures::stream::iter(arrow_reader)
                        .map(|r| r.map_err(Into::into))
                        .boxed())

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Gemini AI reviewer is correct! There is code duplication in the blocks for File and Stream inputs which will be good to be extracted to a helper function and reused. Prevents double maintenance and bugs fixed only in one of the places

}
}
}
JsonArrayState::Done => None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of JsonArrayToNdjsonReader stops processing when it encounters the closing ]. Any characters (including non-whitespace garbage) after the closing bracket will be ignored. This could lead to silently accepting invalid JSON files.

Consider modifying the state machine to handle this. For example, after reaching the Done state, any subsequent non-whitespace characters should result in an InvalidData error.

You could modify fill_internal_buffer to check for this:

// in fill_internal_buffer
for &byte in &raw_buf[..bytes_read] {
    if self.state == JsonArrayState::Done {
        if !byte.is_ascii_whitespace() {
            return Err(std::io::Error::new(
                std::io::ErrorKind::InvalidData,
                "Found garbage after JSON array.",
            ));
        }
        continue;
    }
    if let Some(transformed) = self.process_byte(byte)
    // ...

This will make the parser stricter and more correct.

@augmentcode
Copy link

augmentcode bot commented Jan 26, 2026

🤖 Augment PR Summary

Summary: Adds support for reading JSON data in either newline-delimited JSON (NDJSON) or JSON array form.

Changes:

  • Introduces a new newline_delimited option on JSON read configuration (defaults to NDJSON).
  • Extends NdJsonReadOptions and JSON session/table options to propagate the format choice.
  • Implements JSON-array reading by converting [{...}, {...}] to NDJSON on-the-fly via JsonArrayToNdjsonReader.
  • Updates schema inference to handle both NDJSON and JSON array inputs while tracking consumed records.
  • Guards against range-based scans when using JSON array format (not supported).

Tests: Adds unit tests and SQLLogicTests plus new JSON fixtures for array/empty-array cases.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

CompressionTypeVariant compression = 1; // Compression type
optional uint64 schema_infer_max_rec = 2; // Optional max records for schema inference
optional uint32 compression_level = 3; // Optional compression level
bool newline_delimited = 4; // Whether to read as newline-delimited JSON (default true). When false, expects JSON array format [{},...]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline_delimited is added as a non-optional bool, which will deserialize as false when reading older protobuf payloads that don’t have this field, effectively flipping the default from NDJSON to array format for old plans/configs.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

}

let bytes_read = self.inner.read(&mut raw_buf)?;
if bytes_read == 0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JsonArrayToNdjsonReader exposes validate_complete(), but EOF currently just returns Ok(0) without checking state/depth/in_string, so a malformed JSON array (e.g. missing the closing ]) can be silently accepted and partially read as NDJSON.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The validate_complete() method is not used and this may lead to accepting an invalid JSON as valid. Prevents runtime errors while parsing the JSON due to missing validation while pre-processing the JSON.

/// ]
/// ```
///
/// Note: JSON array format requires loading the entire file into memory.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs state JSON array format "requires loading the entire file into memory", but the implementation uses streaming conversion for GetResultPayload::File paths; this comment may be misleading for users trying to choose between formats.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The new version of the proposed changes still loads the complete JSON array file in the memory for GetResultPayload::Stream and this may lead to out of memory errors for huge files. The file is loaded as a stream from the ObjectStore, so it should be processed as a stream. Prevents out of memory failures when loading a big JSON array file.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

compression: compression.into(),
compression_level: proto_opts.compression_level,
schema_infer_max_rec: proto_opts.schema_infer_max_rec.map(|h| h as usize),
newline_delimited: proto_opts.newline_delimited,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proto deserialization defaults break backward compatibility for JSON format

Medium Severity

The newline_delimited field in the proto JsonOptions uses bool (not optional bool), which defaults to false in proto3. When deserializing old proto messages that lack this field, the value becomes false (JSON array format) instead of the expected true (NDJSON). The Rust JsonOptions correctly defaults to true, and the pbjson deserializer handles this with unwrap_or(true), but the prost binary deserialization paths directly use proto_opts.newline_delimited without any default handling. Additionally, JsonOptionsProto::default() returns newline_delimited = false. This breaks backward compatibility and could cause NDJSON files to be incorrectly read as JSON array format.

Additional Locations (2)

Fix in Cursor Fix in Web

}
if self.newline_delimited {
struct_ser.serialize_field("newlineDelimited", &self.newline_delimited)?;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pbjson serialization/deserialization asymmetry loses false value

Medium Severity

The pbjson serialization for newline_delimited only includes the field when the value is true (line 4609: if self.newline_delimited { serialize... }), but deserialization defaults missing fields to true (line 4721: unwrap_or(true)). This causes a roundtrip bug: serializing JsonOptions with newline_delimited = false omits the field, and deserializing the result gives newline_delimited = true. The serialization condition is inverted from what's needed to preserve the non-default value.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
datafusion/proto-common/src/from_proto/mod.rs (1)

1097-1109: Default missing newline_delimited to true for backward compatibility.

The newline_delimited field in the proto is currently a scalar bool, which defaults to false when absent. This breaks backward compatibility—older serialized JsonOptions without this field will incorrectly deserialize as false instead of the intended default of true.

Making the proto field optional and defaulting to true in deserialization is the correct approach, already implemented in the JSON serialization path. The from_proto conversion should follow the same pattern.

🔧 Proposed fix (after making proto field optional)
-            newline_delimited: proto_opts.newline_delimited,
+            newline_delimited: proto_opts.newline_delimited.unwrap_or(true),
datafusion/proto-common/proto/datafusion_common.proto (1)

468-473: Make newline_delimited field optional to preserve backward compatibility with binary serialization.

Proto3 scalar bools default to false when missing from binary-encoded messages, which breaks backward compatibility: older serialized JsonOptions (from before this field existed) will deserialize as newline_delimited = false instead of the intended default true. Change the field to optional bool to allow proper default handling in deserialization logic.

🔧 Proposed proto change
-  bool newline_delimited = 4; // Whether to read as newline-delimited JSON (default true). When false, expects JSON array format [{},...]
+  optional bool newline_delimited = 4; // Whether to read as newline-delimited JSON (default true). When false, expects JSON array format [{},...]

Update the from_proto conversion to treat None as true:

newline_delimited: proto_opts.newline_delimited.unwrap_or(true),
🤖 Fix all issues with AI agents
In `@datafusion/common/src/config.rs`:
- Around line 3068-3086: The doc comment for the config field newline_delimited
misleadingly states that JSON array format "requires loading the entire file
into memory"; update the comment to reflect actual behavior by softening this
claim and referencing JsonArrayToNdjsonReader: change the note to say JSON array
format "can be more memory‑intensive than NDJSON and may require buffering
depending on the reader implementation (e.g., JsonArrayToNdjsonReader streams
arrays to NDJSON to reduce memory use)" or similar, so readers understand that
streaming mitigates the memory requirement while NDJSON is generally more
memory‑efficient for very large files; edit the comment attached to pub
newline_delimited: bool, default = true accordingly.

In `@datafusion/datasource-json/src/source.rs`:
- Around line 298-314: The current stream path unconditionally collects the
entire stream via the try_fold into a Vec<u8>, risking OOM; modify the
accumulation logic used before calling file_compression_type.convert_read so
that it enforces a configurable maximum size (e.g., MAX_JSON_ARRAY_BYTES) and
returns a clear DataFusionError if exceeded, or alternatively, when the source
indicates a file-like stream, write the incoming chunks to a temporary file and
pass that file into file_compression_type.convert_read/JsonArrayToNdjsonReader
to avoid in-memory buffering; update the code surrounding the
s.map_err(...).try_fold(...) and the subsequent use of JsonArrayToNdjsonReader
and file_compression_type.convert_read to implement the cap-or-spill-to-disk
behavior and surface a clear error message when the limit is hit.

In `@datafusion/datasource-json/src/utils.rs`:
- Around line 238-262: In fill_internal_buffer, avoid dropping transformed bytes
by limiting each read to the remaining capacity (use remaining =
self.buffer.len() - write_pos and read at most min(DEFAULT_BUF_SIZE, remaining)
into raw_buf) instead of always reading DEFAULT_BUF_SIZE, and after a read that
returns 0 (EOF) validate the JSON array state (if self.state !=
JsonArrayState::Done return an error/Err indicating unexpected EOF) so malformed
arrays fail fast; update the loop around self.inner.read, the for byte in
&raw_buf[..bytes_read] processing, and any early breaks in process_byte to
respect write_pos and the JsonArrayState::Done check.
🧹 Nitpick comments (2)
datafusion/datasource-json/src/mod.rs (1)

26-29: Consider whether JsonArrayToNdjsonReader should be part of the public API.

The re-export at line 29 makes JsonArrayToNdjsonReader part of the crate's public API surface. This is appropriate if external users may need to use this utility for custom JSON array processing. However, if it's primarily an internal implementation detail for the JSON format handling, consider keeping it internal (removing the pub use re-export while keeping pub mod utils for internal crate access).

If intentionally public, this is a clean API addition that follows the existing re-export pattern in this file.

datafusion-examples/examples/custom_data_source/csv_json_opener.rs (1)

123-129: Prefer a named flag for readability in the example.

The trailing true is a magic boolean; consider binding it to a named variable or adding a short comment.

💡 Suggested tweak
-    let opener = JsonOpener::new(
-        8192,
-        projected,
-        FileCompressionType::UNCOMPRESSED,
-        Arc::new(object_store),
-        true,
-    );
+    let newline_delimited = true;
+    let opener = JsonOpener::new(
+        8192,
+        projected,
+        FileCompressionType::UNCOMPRESSED,
+        Arc::new(object_store),
+        newline_delimited,
+    );

Comment on lines +3068 to +3086
/// The JSON format to use when reading files.
///
/// When `true` (default), expects newline-delimited JSON (NDJSON):
/// ```text
/// {"key1": 1, "key2": "val"}
/// {"key1": 2, "key2": "vals"}
/// ```
///
/// When `false`, expects JSON array format:
/// ```text
/// [
/// {"key1": 1, "key2": "val"},
/// {"key1": 2, "key2": "vals"}
/// ]
/// ```
///
/// Note: JSON array format requires loading the entire file into memory.
/// For large files, newline-delimited format is recommended.
pub newline_delimited: bool, default = true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the JSON array memory note.

The comment says JSON array format requires loading the entire file into memory. If the new JsonArrayToNdjsonReader streams, this wording is misleading. Consider softening to “may be more memory‑intensive than NDJSON” or align it with the actual behavior.

✏️ Suggested doc tweak
-        /// Note: JSON array format requires loading the entire file into memory.
-        /// For large files, newline-delimited format is recommended.
+        /// Note: JSON array parsing can be more memory-intensive than NDJSON.
+        /// For large files, NDJSON is recommended.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// The JSON format to use when reading files.
///
/// When `true` (default), expects newline-delimited JSON (NDJSON):
/// ```text
/// {"key1": 1, "key2": "val"}
/// {"key1": 2, "key2": "vals"}
/// ```
///
/// When `false`, expects JSON array format:
/// ```text
/// [
/// {"key1": 1, "key2": "val"},
/// {"key1": 2, "key2": "vals"}
/// ]
/// ```
///
/// Note: JSON array format requires loading the entire file into memory.
/// For large files, newline-delimited format is recommended.
pub newline_delimited: bool, default = true
/// The JSON format to use when reading files.
///
/// When `true` (default), expects newline-delimited JSON (NDJSON):
///
🤖 Prompt for AI Agents
In `@datafusion/common/src/config.rs` around lines 3068 - 3086, The doc comment
for the config field newline_delimited misleadingly states that JSON array
format "requires loading the entire file into memory"; update the comment to
reflect actual behavior by softening this claim and referencing
JsonArrayToNdjsonReader: change the note to say JSON array format "can be more
memory‑intensive than NDJSON and may require buffering depending on the reader
implementation (e.g., JsonArrayToNdjsonReader streams arrays to NDJSON to reduce
memory use)" or similar, so readers understand that streaming mitigates the
memory requirement while NDJSON is generally more memory‑efficient for very
large files; edit the comment attached to pub newline_delimited: bool, default =
true accordingly.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The new version of the proposed changes still loads the complete JSON array file in the memory and this may lead to out of memory errors for huge files. The file is loaded as a stream from the ObjectStore, so it should be processed as a stream. Prevents out of memory failures when loading a big JSON array file.

Comment on lines +298 to +314
// JSON array format from stream: collect bytes first, then use streaming converter
// Note: We still need to collect for streams, but the converter avoids
// additional memory overhead from serde_json parsing
let bytes = s
.map_err(DataFusionError::from)
.try_fold(Vec::new(), |mut acc, chunk| async move {
acc.extend_from_slice(&chunk);
Ok(acc)
})
.await?;

let decompressed = file_compression_type
.convert_read(std::io::Cursor::new(bytes))?;

// Use streaming converter - it implements BufRead
let ndjson_reader = JsonArrayToNdjsonReader::new(decompressed);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the project structure and find the Cargo.toml for the datasource-json crate
find . -name "Cargo.toml" -path "*/datasource-json/*" | head -5

Repository: martin-augment/datafusion

Length of output: 108


🏁 Script executed:

# Check the current file and its context
wc -l datafusion/datasource-json/src/source.rs

Repository: martin-augment/datafusion

Length of output: 113


🏁 Script executed:

# List the dependencies in the datasource-json crate
cat datafusion/datasource-json/Cargo.toml

Repository: martin-augment/datafusion

Length of output: 2049


🏁 Script executed:

# Find the JsonArrayToNdjsonReader implementation
rg "JsonArrayToNdjsonReader" -A 20 datafusion/datasource-json/src/

Repository: martin-augment/datafusion

Length of output: 23331


🏁 Script executed:

# Check what traits JsonArrayToNdjsonReader requires
ast-grep --pattern 'struct JsonArrayToNdjsonReader { $$$ }'

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Check the context around lines 298-314 to see what type `s` is and full context
sed -n '280,330p' datafusion/datasource-json/src/source.rs

Repository: martin-augment/datafusion

Length of output: 2372


🏁 Script executed:

# Check if tokio-util is available as a dependency anywhere in the workspace
rg "tokio-util" --type toml

Repository: martin-augment/datafusion

Length of output: 591


🏁 Script executed:

# Look for how other parts of the code handle stream-to-read conversions
rg "StreamReader\|SyncIoBridge" --type rs

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Check the file_compression_type.convert_read signature to understand what it expects
rg "convert_read" -A 5 --type rs | head -40

Repository: martin-augment/datafusion

Length of output: 95


🏁 Script executed:

# Fix rg command - use correct language identifier for Rust
rg "convert_read" -A 5 | head -50

Repository: martin-augment/datafusion

Length of output: 4021


🏁 Script executed:

# Find the type of the stream `s` and understand the signature better
sed -n '280,290p' datafusion/datasource-json/src/source.rs | cat -n

Repository: martin-augment/datafusion

Length of output: 662


🏁 Script executed:

# Look for convert_read and convert_stream implementations
rg "fn convert_read|fn convert_stream" -B 2 -A 8 | head -80

Repository: martin-augment/datafusion

Length of output: 2116


🏁 Script executed:

# Check if SyncIoBridge is a real thing in tokio-util 0.7.17
web_search
SyncIoBridge tokio-util 0.7.17

Repository: martin-augment/datafusion

Length of output: 170


🏁 Script executed:

# Look for how file_compression_type is used elsewhere in the codebase
rg "file_compression_type\.convert" -B 2 -A 2 | head -50

Repository: martin-augment/datafusion

Length of output: 4163


🌐 Web query:

tokio-util 0.7.17 SyncIoBridge StreamReader

💡 Result:

In tokio-util 0.7.x (incl. 0.7.17), these two adapters live in tokio_util::io:

  • StreamReader: converts a Stream<Item = Result<B, E>> (where B: bytes::Buf, E: Into<std::io::Error>) into an AsyncRead (also AsyncBufRead). You build it with StreamReader::new(stream). [1]
  • SyncIoBridge: converts an AsyncRead / AsyncWrite into synchronous std::io::Read / std::io::Write by blocking on a Tokio runtime handle. It must be used from a blocking thread (e.g. inside tokio::task::spawn_blocking); calling it from an async context can panic. [2]

To use them together (Stream → AsyncRead → std::io::Read):

use tokio::task::spawn_blocking;
use tokio_util::io::{StreamReader, SyncIoBridge};
use tokio_stream::StreamExt; // if you need map()
use bytes::Bytes;
use std::io::Read;

async fn consume_stream_sync<S, E>(stream: S) -> std::io::Result<Vec<u8>>
where
    S: futures_core::Stream<Item = Result<Bytes, E>> + Send + 'static,
    E: Into<std::io::Error> + Send + 'static,
{
    let reader = StreamReader::new(stream);      // Stream -> AsyncRead [1]
    let mut sync_reader = SyncIoBridge::new(reader); // AsyncRead -> Read [2]

    spawn_blocking(move || {
        let mut out = Vec::new();
        sync_reader.read_to_end(&mut out)?;
        Ok(out)
    })
    .await?
}

Cargo features (0.7.17): enable io-util (it pulls in io) to get SyncIoBridge; StreamReader is under io. [3]

Sources: [1] [2] [3]


🌐 Web query:

tokio-util 0.7 io module available types

💡 Result:

In tokio-util 0.7, the tokio_util::io module (enabled with the io feature) includes these items: [1]

  • Submodule

    • simplex (unidirectional byte-oriented channel) [1]
  • Structs

    • CopyToBytes
    • InspectReader
    • InspectWriter
    • ReaderStream
    • SinkWriter
    • StreamReader
    • SyncIoBridge (requires io-util feature) [1]
  • Functions

    • poll_read_buf
    • poll_write_buf
    • read_buf
    • read_exact_arc (requires io-util feature) [1]

Sources: [1]


🏁 Script executed:

# Check what actual dependencies datasource-json has, versus datasource
cat datafusion/datasource-json/Cargo.toml | grep -A 50 "\[dependencies\]"

Repository: martin-augment/datafusion

Length of output: 920


🏁 Script executed:

# Look at the type signature of the stream `s` more carefully - what's returned by GetResultPayload::Stream
rg "GetResultPayload" -B 5 -A 5 | head -60

Repository: martin-augment/datafusion

Length of output: 4265


🏁 Script executed:

# Check if there are any async-to-sync adapters used elsewhere in the codebase
rg "StreamReader\|to_sync\|sync.*adapter" | head -20

Repository: martin-augment/datafusion

Length of output: 51


🏁 Script executed:

# Verify the async context and return type more carefully
sed -n '240,260p' datafusion/datasource-json/src/source.rs

Repository: martin-augment/datafusion

Length of output: 938


🏁 Script executed:

# Check if there are any examples of spawn_blocking used in similar contexts
rg "spawn_blocking" -B 3 -A 3 | head -40

Repository: martin-augment/datafusion

Length of output: 3877


Avoid unbounded memory buffering for JSON array streams.

The stream path collects the entire file into Vec<u8>, which can grow without bounds and cause OOM for large arrays. This undermines the streaming design of JsonArrayToNdjsonReader.

Instead: add an explicit size cap and return a clear error if exceeded, or prefer the file-based code path when the stream represents a file. If neither is practical, document the limitation and memory trade-off clearly.

🤖 Prompt for AI Agents
In `@datafusion/datasource-json/src/source.rs` around lines 298 - 314, The current
stream path unconditionally collects the entire stream via the try_fold into a
Vec<u8>, risking OOM; modify the accumulation logic used before calling
file_compression_type.convert_read so that it enforces a configurable maximum
size (e.g., MAX_JSON_ARRAY_BYTES) and returns a clear DataFusionError if
exceeded, or alternatively, when the source indicates a file-like stream, write
the incoming chunks to a temporary file and pass that file into
file_compression_type.convert_read/JsonArrayToNdjsonReader to avoid in-memory
buffering; update the code surrounding the s.map_err(...).try_fold(...) and the
subsequent use of JsonArrayToNdjsonReader and file_compression_type.convert_read
to implement the cap-or-spill-to-disk behavior and surface a clear error message
when the limit is hit.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The new version of the proposed changes still loads the complete JSON array file in the memory and this may lead to out of memory errors for huge files. The file is loaded as a stream from the ObjectStore, so it should be processed as a stream. Prevents out of memory failures when loading a big JSON array file.

Comment on lines +238 to +262
// Read raw data from inner reader
let mut raw_buf = vec![0u8; DEFAULT_BUF_SIZE];
let mut write_pos = 0;

loop {
if write_pos >= self.buffer.len() {
break;
}

let bytes_read = self.inner.read(&mut raw_buf)?;
if bytes_read == 0 {
break; // EOF
}

for &byte in &raw_buf[..bytes_read] {
if self.state == JsonArrayState::Done {
break;
}
if let Some(transformed) = self.process_byte(byte)
&& write_pos < self.buffer.len()
{
self.buffer[write_pos] = transformed;
write_pos += 1;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Prevent dropped bytes and enforce EOF validation in fill_internal_buffer.

Reading a full raw chunk while the output buffer has limited remaining space can drop transformed bytes, and EOF before a closing ] currently returns Ok(0) (silently accepting malformed or misformatted input). Limit reads to the remaining capacity and validate on EOF so invalid arrays fail fast.

🐛 Proposed fix
-            let bytes_read = self.inner.read(&mut raw_buf)?;
-            if bytes_read == 0 {
-                break; // EOF
-            }
+            let remaining = self.buffer.len() - write_pos;
+            let bytes_read = self.inner.read(&mut raw_buf[..remaining])?;
+            if bytes_read == 0 {
+                // Only validate when we have no buffered output left to emit
+                if write_pos == 0 {
+                    self.validate_complete()?;
+                }
+                break; // EOF
+            }
🤖 Prompt for AI Agents
In `@datafusion/datasource-json/src/utils.rs` around lines 238 - 262, In
fill_internal_buffer, avoid dropping transformed bytes by limiting each read to
the remaining capacity (use remaining = self.buffer.len() - write_pos and read
at most min(DEFAULT_BUF_SIZE, remaining) into raw_buf) instead of always reading
DEFAULT_BUF_SIZE, and after a read that returns 0 (EOF) validate the JSON array
state (if self.state != JsonArrayState::Done return an error/Err indicating
unexpected EOF) so malformed arrays fail fast; update the loop around
self.inner.read, the for byte in &raw_buf[..bytes_read] processing, and any
early breaks in process_byte to respect write_pos and the JsonArrayState::Done
check.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The validate_complete() method is not used and this may lead to accepting an invalid JSON as valid. Prevents runtime errors while parsing the JSON due to missing validation while pre-processing the JSON.

@martin-augment
Copy link
Owner Author

1. Security/Robustness (datafusion/datasource-json/src/utils.rs)

Issue: Malformed JSON validation only happens after processing completes

The validate_complete() method is defined but never called in the actual processing flow. The reader can reach EOF with unbalanced braces without raising an error. Arrow's JSON reader might catch this, but it's better to fail fast with a clear error.

Recommendation: Call validate_complete() after reading completes in JsonOpener::open() or handle validation during the read process.

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The validate_complete() method is not used and this may lead to accepting an invalid JSON as valid. Prevents runtime errors while parsing the JSON due to missing validation while pre-processing the JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants