Skip to content

19924: Support JSON arrays reader/parse for datafusion#205

Closed
martin-augment wants to merge 1 commit intomainfrom
pr-19924-2026-01-23-07-55-14
Closed

19924: Support JSON arrays reader/parse for datafusion#205
martin-augment wants to merge 1 commit intomainfrom
pr-19924-2026-01-23-07-55-14

Conversation

@martin-augment
Copy link
Owner

19924: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends DataFusion's capabilities for reading JSON data by introducing support for JSON files structured as a single array of objects. Previously, DataFusion primarily handled newline-delimited JSON. With the new format_array option, users can now seamlessly ingest both common JSON formats, enhancing flexibility and compatibility with various data sources. The implementation includes updates to schema inference, data reading, and serialization/deserialization logic, ensuring robust handling of the new format, along with thorough testing.

Highlights

  • JSON Array Format Support: DataFusion can now read JSON files where the entire content is a single JSON array of objects, in addition to the existing newline-delimited JSON (NDJSON) format.
  • format_array Option: A new boolean option, format_array, has been introduced in JsonOptions and NdJsonReadOptions to specify whether the input JSON is in array format. It defaults to false (NDJSON).
  • Schema Inference and Data Reading Logic: The internal logic for inferring schemas and reading data from JSON files has been updated to conditionally handle both NDJSON and array JSON formats, leveraging serde_json for array parsing.
  • New Test Cases: Comprehensive unit and SQL logic tests have been added to validate the new functionality, covering schema inference, data reading, projections, empty arrays, and interaction with compression.
  • serde_json Dependency: The datafusion-datasource-json crate now includes serde_json to facilitate parsing of JSON array structures.
  • Limitation on Range Scanning: A known limitation is that JSON array format does not support range-based file scanning, meaning repartition_file_scans must be disabled when using this format.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 23, 2026

Walkthrough

This pull request adds support for JSON array format handling alongside the existing line-delimited NDJSON format in DataFusion. A new format_array boolean field is introduced to JsonOptions configuration and propagated through NdJsonReadOptions, JsonFormat, JsonSource, and JsonOpener. Array-format JSON data is parsed via a new read_json_array_to_batches helper function. The protobuf serialization layer and logical plan file format handling are updated to serialize and deserialize the new option. Test data files and comprehensive test cases covering schema inference, empty arrays, data reading, projection, and compression scenarios are included. SQL logic tests validate the feature's integration.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 23, 2026

PR Review: Support JSON Arrays Reader/Parse for DataFusion

Overview

This PR adds support for reading JSON files in array format in addition to the existing line-delimited (NDJSON) format.

Overall Assessment: Good implementation with comprehensive testing. A few concerns need to be addressed before merging.


Code Quality and Best Practices

Strengths:

  • Well-structured code with clear separation of concerns
  • Excellent documentation and doc comments
  • Consistent API design with builder pattern
  • Proper propagation through config layers

Issues:

  1. Breaking API Change (csv_json_opener.rs:125)

    • Adding required boolean parameter to JsonOpener::new() is breaking
    • Consider builder pattern or default constructor for backward compatibility
  2. Code Duplication (file_format.rs and source.rs)

    • Functions infer_json_schema_from_json_array and read_json_array_to_batches duplicate logic
    • Consider extracting shared code

Potential Bugs and Issues

  1. Memory Concern - Reading Entire File

    • Loads entire JSON file into memory twice
    • Could cause OOM errors for large files
    • Recommendation: Document limitation or add size guard
  2. Empty Array Handling Inconsistency

    • Schema inference errors on empty arrays but reading returns empty batch
    • Should be consistent

Performance Considerations

  1. Critical: Inefficient Double Conversion (source.rs:763-767)

    • Parse JSON then serialize to NDJSON then parse again
    • Very inefficient for large files
    • Recommendation: Use arrow-json directly on parsed values
  2. Streaming Breaks for Array Format

    • Collects all bytes before processing defeating streaming purpose
    • Document as limitation

Security Concerns

  1. Potential DoS via Large JSON Arrays
    • No size limits on JSON array parsing
    • Multi-GB files could exhaust resources
    • Recommendation: Add max file size or array element config

Test Coverage

Strengths:

  • Excellent coverage with 8 new test functions
  • Tests cover: basic reading, empty arrays, schema inference, projection, compression, SQL integration

Suggestions:

  1. Test for very large JSON arrays
  2. Test for malformed JSON
  3. Test for mixed types
  4. Benchmark tests

Recommendation

Must Fix:

  1. Fix breaking API change in JsonOpener::new()
  2. Add size limits to prevent DoS/OOM
  3. Resolve empty array handling inconsistency
  4. Document memory limitations

Should Fix:
5. Improve performance of double JSON conversion
6. Reduce code duplication

Nice to Have:
7. Additional error tests
8. Performance benchmarks


Great work overall! Main concerns are memory safety and performance for large files.

@augmentcode
Copy link

augmentcode bot commented Jan 23, 2026

🤖 Augment PR Summary

Summary: This PR adds first-class support for reading JSON files whose top-level value is an array (e.g. [{...}, {...}]) in addition to newline-delimited JSON (NDJSON).

Changes:

  • Introduced a new format_array option under JSON table options/config, with documentation and defaults.
  • Extended protobuf serialization/deserialization of JsonOptions to include format_array.
  • Added format_array to NdJsonReadOptions and wired it into JsonFormat construction.
  • Enhanced JsonFormat schema inference to handle JSON arrays using serde_json.
  • Updated JsonSource/JsonOpener to read array-format JSON by loading the full file and converting it to NDJSON batches for Arrow’s JSON reader.
  • Added unit tests covering array schema inference, limiting, scanning, projection, and compressed inputs.
  • Added SQLLogicTest coverage plus new JSON test data files.
  • Adjusted the custom data source example for the updated JSON opener signature.

Technical Notes: Array mode disables range-based scanning and currently requires full-file reads to parse/convert the array into record batches.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

infer_json_schema_from_iterator(iter.take_while(|_| take_while()))?

if is_array_format {
infer_json_schema_from_json_array(&mut reader, records_to_read)?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In infer_schema, records_to_read is only decremented via the NDJSON take_while() path, so in JSON-array mode it never decreases and the schema_infer_max_rec limit / early break won’t apply across multiple files. This can cause schema inference to scan more records/files than requested and potentially infer additional fields unexpectedly.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

pub file_sort_order: Vec<Vec<SortExpr>>,
/// Whether the JSON file is in array format `[{...}, {...}]` instead of
/// line-delimited format. Defaults to `false`.
pub format_array: bool,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding pub format_array to the public NdJsonReadOptions struct is a semver-breaking change for downstream users that construct it with a struct literal. Consider whether NdJsonReadOptions should be #[non_exhaustive] (or otherwise avoid adding new public fields) to preserve forward compatibility.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
datafusion/datasource-json/src/file_format.rs (1)

200-304: Schema inference limit isn’t enforced across files for array format.
In the array branch, records_to_read never decreases, so schema_infer_max_rec is applied per file instead of globally. This diverges from the line-delimited behavior and can over-read schemas when multiple files are provided.

🔧 Proposed fix (track how many array elements were consumed)
-fn infer_json_schema_from_json_array<R: Read>(
-    reader: &mut R,
-    max_records: usize,
-) -> std::result::Result<Schema, ArrowError> {
+fn infer_json_schema_from_json_array<R: Read>(
+    reader: &mut R,
+    max_records: usize,
+) -> std::result::Result<(Schema, usize), ArrowError> {
     let mut content = String::new();
     reader.read_to_string(&mut content).map_err(|e| {
         ArrowError::JsonError(format!("Failed to read JSON content: {e}"))
     })?;
 
     // Parse as JSON array using serde_json
     let values: Vec<serde_json::Value> = serde_json::from_str(&content)
         .map_err(|e| ArrowError::JsonError(format!("Failed to parse JSON array: {e}")))?;
 
     // Take only max_records for schema inference
-    let values_to_infer: Vec<_> = values.into_iter().take(max_records).collect();
+    let take = values.len().min(max_records);
+    let values_to_infer: Vec<_> = values.into_iter().take(take).collect();
 
-    if values_to_infer.is_empty() {
+    if take == 0 {
         return Err(ArrowError::JsonError(
             "JSON array is empty, cannot infer schema".to_string(),
         ));
     }
 
     // Use arrow's schema inference on the parsed values
-    infer_json_schema_from_iterator(values_to_infer.into_iter().map(Ok))
+    let schema = infer_json_schema_from_iterator(values_to_infer.into_iter().map(Ok))?;
+    Ok((schema, take))
 }
@@
-                    if is_array_format {
-                        infer_json_schema_from_json_array(&mut reader, records_to_read)?
-                    } else {
+                    if is_array_format {
+                        let (schema, used) =
+                            infer_json_schema_from_json_array(&mut reader, records_to_read)?;
+                        records_to_read = records_to_read.saturating_sub(used);
+                        schema
+                    } else {
                         let iter = ValueIter::new(&mut reader, None);
                         infer_json_schema_from_iterator(
                             iter.take_while(|_| take_while()),
                         )?
                     }
@@
-                    if is_array_format {
-                        infer_json_schema_from_json_array(&mut reader, records_to_read)?
-                    } else {
+                    if is_array_format {
+                        let (schema, used) =
+                            infer_json_schema_from_json_array(&mut reader, records_to_read)?;
+                        records_to_read = records_to_read.saturating_sub(used);
+                        schema
+                    } else {
                         let iter = ValueIter::new(&mut reader, None);
                         infer_json_schema_from_iterator(
                             iter.take_while(|_| take_while()),
                         )?
                     }
🧹 Nitpick comments (5)
datafusion-examples/examples/custom_data_source/csv_json_opener.rs (1)

123-129: Consider adding a comment to clarify the boolean parameter.

In example code, a bare false doesn't convey its purpose to readers. Since this PR introduces the format_array flag, adding a brief comment would help users understand the distinction between NDJSON and array-format JSON.

📝 Suggested improvement
     let opener = JsonOpener::new(
         8192,
         projected,
         FileCompressionType::UNCOMPRESSED,
         Arc::new(object_store),
-        false,
+        false, // format_array: false for NDJSON (newline-delimited), true for JSON array format
     );
datafusion/datasource-json/src/source.rs (4)

266-282: Consider optimizing memory usage for the streaming path.

The streaming path accumulates all bytes into a Vec<u8>, then the decompressed reader is passed to read_json_array_to_batches, which calls read_to_string creating another copy. This results in multiple in-memory copies of the data.

For large files, consider passing the bytes directly to avoid the extra copy, or using a streaming JSON parser that can handle arrays incrementally.


308-308: Remove redundant import.

ReaderBuilder is already imported at line 40 via use arrow::json::ReaderBuilder;. This inner import is unnecessary.

Proposed fix
 fn read_json_array_to_batches<R: Read>(
     mut reader: R,
     schema: SchemaRef,
     batch_size: usize,
 ) -> Result<Vec<RecordBatch>> {
-    use arrow::json::ReaderBuilder;
-
     let mut content = String::new();

310-333: Consider adding a clearer error message for non-array JSON input.

If a user accidentally provides a JSON object (e.g., {"key": "value"}) instead of a JSON array, the current code produces a generic serde deserialization error. Consider wrapping with a more descriptive error:

Proposed improvement
     // Parse JSON array
-    let values: Vec<serde_json::Value> = serde_json::from_str(&content)
-        .map_err(|e| DataFusionError::External(Box::new(e)))?;
+    let values: Vec<serde_json::Value> = serde_json::from_str(&content).map_err(|e| {
+        DataFusionError::External(format!(
+            "Failed to parse JSON array (ensure input is a JSON array, not an object): {e}"
+        ).into())
+    })?;

321-326: Memory optimization opportunity: avoid double serialization.

The current implementation parses the entire JSON array into Vec<serde_json::Value>, then serializes each value back to strings to create NDJSON. This involves:

  1. content String (file content)
  2. Vec<serde_json::Value> (parsed representation)
  3. ndjson String (re-serialized)

For large files, this can use ~3x the file size in peak memory.

A more efficient approach would be to write directly to a buffer while iterating, or consider if Arrow's RawDecoder can handle the values directly without the NDJSON round-trip.

Slightly improved version using a single buffer
-    // Convert to NDJSON string for arrow-json reader
-    let ndjson: String = values
-        .iter()
-        .map(|v| v.to_string())
-        .collect::<Vec<_>>()
-        .join("\n");
+    // Convert to NDJSON format for arrow-json reader
+    let mut ndjson = String::new();
+    for (i, v) in values.iter().enumerate() {
+        if i > 0 {
+            ndjson.push('\n');
+        }
+        // Write directly to ndjson to avoid intermediate Vec<String>
+        use std::fmt::Write;
+        write!(&mut ndjson, "{}", v).expect("String write cannot fail");
+    }

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

infer_json_schema_from_iterator(iter.take_while(|_| take_while()))?

if is_array_format {
infer_json_schema_from_json_array(&mut reader, records_to_read)?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema inference ignores record limit across files

Medium Severity

When inferring schema from multiple JSON array files with schema_infer_max_rec configured, the code doesn't properly track the remaining records to read across files. Unlike the NDJSON path which decrements records_to_read via the take_while closure, the array format path passes records_to_read to infer_json_schema_from_json_array but never updates it afterward. This causes all files to be processed instead of stopping after reading the specified maximum number of records.

Fix in Cursor Fix in Web

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for reading JSON files in array format, in addition to the existing newline-delimited JSON (NDJSON) format. The changes include adding a format_array option to JsonOptions and NdJsonReadOptions, updating the schema inference logic, and modifying the data reading path to handle JSON arrays. New test cases have been added to cover the new functionality, including empty arrays, schema inference limits, data reading, projection, and compression. The protobuf definitions and their serialization/deserialization logic have also been updated to reflect these changes.

Comment on lines +209 to +211
reader.read_to_string(&mut content).map_err(|e| {
ArrowError::JsonError(format!("Failed to read JSON content: {e}"))
})?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The infer_json_schema_from_json_array function reads the entire file content into a String here. For very large JSON array files, this could lead to Out-Of-Memory (OOM) issues, especially during schema inference where only a subset of records might be needed. Consider streaming the data or using a more memory-efficient parsing approach if possible, particularly for schema inference where max_records is intended to limit the data scanned.

Comment on lines +267 to +274
// For streaming, we need to collect all bytes first
let bytes = s
.map_err(DataFusionError::from)
.try_fold(Vec::new(), |mut acc, chunk| async move {
acc.extend_from_slice(&chunk);
Ok(acc)
})
.await?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For streaming payloads, the entire stream is collected into a Vec<u8> before processing. This can lead to Out-Of-Memory (OOM) errors for large JSON array files, similar to the schema inference logic. While format_array is a new feature, handling large files efficiently is crucial for a data processing framework. Consider if there's a way to process the stream incrementally without loading the entire file into memory.

Comment on lines +310 to +311
let mut content = String::new();
reader.read_to_string(&mut content)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The read_json_array_to_batches function reads the entire file content into a String and then parses it with serde_json::from_str. This approach, while functional, can be memory-intensive for large files. Additionally, converting the parsed serde_json::Values back into an NDJSON string and then using arrow::json::ReaderBuilder to build record batches introduces an unnecessary intermediate serialization/deserialization step. It would be more efficient to directly convert serde_json::Value into Arrow arrays or to use a parser that can directly produce Arrow arrays from a JSON array structure without intermediate string conversion.

@martin-augment
Copy link
Owner Author

Superseded by #206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants