19924: Support JSON arrays reader/parse for datafusion by martin-augment · Pull Request #205 · martin-augment/datafusion

martin-augment · 2026-01-23T07:55:18Z

19924: To review by AI

gemini-code-assist · 2026-01-23T07:55:46Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends DataFusion's capabilities for reading JSON data by introducing support for JSON files structured as a single array of objects. Previously, DataFusion primarily handled newline-delimited JSON. With the new format_array option, users can now seamlessly ingest both common JSON formats, enhancing flexibility and compatibility with various data sources. The implementation includes updates to schema inference, data reading, and serialization/deserialization logic, ensuring robust handling of the new format, along with thorough testing.

Highlights

JSON Array Format Support: DataFusion can now read JSON files where the entire content is a single JSON array of objects, in addition to the existing newline-delimited JSON (NDJSON) format.
format_array Option: A new boolean option, format_array, has been introduced in JsonOptions and NdJsonReadOptions to specify whether the input JSON is in array format. It defaults to false (NDJSON).
Schema Inference and Data Reading Logic: The internal logic for inferring schemas and reading data from JSON files has been updated to conditionally handle both NDJSON and array JSON formats, leveraging serde_json for array parsing.
New Test Cases: Comprehensive unit and SQL logic tests have been added to validate the new functionality, covering schema inference, data reading, projections, empty arrays, and interaction with compression.
serde_json Dependency: The datafusion-datasource-json crate now includes serde_json to facilitate parsing of JSON array structures.
Limitation on Range Scanning: A known limitation is that JSON array format does not support range-based file scanning, meaning repartition_file_scans must be disabled when using this format.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-23T07:56:15Z

Walkthrough

This pull request adds support for JSON array format handling alongside the existing line-delimited NDJSON format in DataFusion. A new format_array boolean field is introduced to JsonOptions configuration and propagated through NdJsonReadOptions, JsonFormat, JsonSource, and JsonOpener. Array-format JSON data is parsed via a new read_json_array_to_batches helper function. The protobuf serialization layer and logical plan file format handling are updated to serialize and deserialize the new option. Test data files and comprehensive test cases covering schema inference, empty arrays, data reading, projection, and compression scenarios are included. SQL logic tests validate the feature's integration.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-23T07:59:21Z

PR Review: Support JSON Arrays Reader/Parse for DataFusion

Overview

This PR adds support for reading JSON files in array format in addition to the existing line-delimited (NDJSON) format.

Overall Assessment: Good implementation with comprehensive testing. A few concerns need to be addressed before merging.

Code Quality and Best Practices

Strengths:

Well-structured code with clear separation of concerns
Excellent documentation and doc comments
Consistent API design with builder pattern
Proper propagation through config layers

Issues:

Breaking API Change (csv_json_opener.rs:125)
- Adding required boolean parameter to JsonOpener::new() is breaking
- Consider builder pattern or default constructor for backward compatibility
Code Duplication (file_format.rs and source.rs)
- Functions infer_json_schema_from_json_array and read_json_array_to_batches duplicate logic
- Consider extracting shared code

Potential Bugs and Issues

Memory Concern - Reading Entire File
- Loads entire JSON file into memory twice
- Could cause OOM errors for large files
- Recommendation: Document limitation or add size guard
Empty Array Handling Inconsistency
- Schema inference errors on empty arrays but reading returns empty batch
- Should be consistent

Performance Considerations

Critical: Inefficient Double Conversion (source.rs:763-767)
- Parse JSON then serialize to NDJSON then parse again
- Very inefficient for large files
- Recommendation: Use arrow-json directly on parsed values
Streaming Breaks for Array Format
- Collects all bytes before processing defeating streaming purpose
- Document as limitation

Security Concerns

Potential DoS via Large JSON Arrays
- No size limits on JSON array parsing
- Multi-GB files could exhaust resources
- Recommendation: Add max file size or array element config

Test Coverage

Strengths:

Excellent coverage with 8 new test functions
Tests cover: basic reading, empty arrays, schema inference, projection, compression, SQL integration

Suggestions:

Test for very large JSON arrays
Test for malformed JSON
Test for mixed types
Benchmark tests

Recommendation

Must Fix:

Fix breaking API change in JsonOpener::new()
Add size limits to prevent DoS/OOM
Resolve empty array handling inconsistency
Document memory limitations

Should Fix:
5. Improve performance of double JSON conversion
6. Reduce code duplication

Nice to Have:
7. Additional error tests
8. Performance benchmarks

Great work overall! Main concerns are memory safety and performance for large files.

augmentcode · 2026-01-23T08:02:03Z

🤖 Augment PR Summary

Summary: This PR adds first-class support for reading JSON files whose top-level value is an array (e.g. [{...}, {...}]) in addition to newline-delimited JSON (NDJSON).

Changes:

Introduced a new format_array option under JSON table options/config, with documentation and defaults.
Extended protobuf serialization/deserialization of JsonOptions to include format_array.
Added format_array to NdJsonReadOptions and wired it into JsonFormat construction.
Enhanced JsonFormat schema inference to handle JSON arrays using serde_json.
Updated JsonSource/JsonOpener to read array-format JSON by loading the full file and converting it to NDJSON batches for Arrow’s JSON reader.
Added unit tests covering array schema inference, limiting, scanning, projection, and compressed inputs.
Added SQLLogicTest coverage plus new JSON test data files.
Adjusted the custom data source example for the updated JSON opener signature.

Technical Notes: Array mode disables range-based scanning and currently requires full-file reads to parse/convert the array into record batches.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-23T08:02:04Z

datafusion/datasource-json/src/file_format.rs

-                    infer_json_schema_from_iterator(iter.take_while(|_| take_while()))?
+
+                    if is_array_format {
+                        infer_json_schema_from_json_array(&mut reader, records_to_read)?


In infer_schema, records_to_read is only decremented via the NDJSON take_while() path, so in JSON-array mode it never decreases and the schema_infer_max_rec limit / early break won’t apply across multiple files. This can cause schema inference to scan more records/files than requested and potentially infer additional fields unexpectedly.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-23T08:02:04Z

datafusion/core/src/datasource/file_format/options.rs

    pub file_sort_order: Vec<Vec<SortExpr>>,
+    /// Whether the JSON file is in array format `[{...}, {...}]` instead of
+    /// line-delimited format. Defaults to `false`.
+    pub format_array: bool,


Adding pub format_array to the public NdJsonReadOptions struct is a semver-breaking change for downstream users that construct it with a struct literal. Consider whether NdJsonReadOptions should be #[non_exhaustive] (or otherwise avoid adding new public fields) to preserve forward compatibility.

_{🤖 Was this useful? React with 👍 or 👎}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

datafusion/datasource-json/src/file_format.rs (1)

200-304: Schema inference limit isn’t enforced across files for array format.
In the array branch, records_to_read never decreases, so schema_infer_max_rec is applied per file instead of globally. This diverges from the line-delimited behavior and can over-read schemas when multiple files are provided.

🔧 Proposed fix (track how many array elements were consumed)

-fn infer_json_schema_from_json_array<R: Read>(
-    reader: &mut R,
-    max_records: usize,
-) -> std::result::Result<Schema, ArrowError> {
+fn infer_json_schema_from_json_array<R: Read>(
+    reader: &mut R,
+    max_records: usize,
+) -> std::result::Result<(Schema, usize), ArrowError> {
     let mut content = String::new();
     reader.read_to_string(&mut content).map_err(|e| {
         ArrowError::JsonError(format!("Failed to read JSON content: {e}"))
     })?;
 
     // Parse as JSON array using serde_json
     let values: Vec<serde_json::Value> = serde_json::from_str(&content)
         .map_err(|e| ArrowError::JsonError(format!("Failed to parse JSON array: {e}")))?;
 
     // Take only max_records for schema inference
-    let values_to_infer: Vec<_> = values.into_iter().take(max_records).collect();
+    let take = values.len().min(max_records);
+    let values_to_infer: Vec<_> = values.into_iter().take(take).collect();
 
-    if values_to_infer.is_empty() {
+    if take == 0 {
         return Err(ArrowError::JsonError(
             "JSON array is empty, cannot infer schema".to_string(),
         ));
     }
 
     // Use arrow's schema inference on the parsed values
-    infer_json_schema_from_iterator(values_to_infer.into_iter().map(Ok))
+    let schema = infer_json_schema_from_iterator(values_to_infer.into_iter().map(Ok))?;
+    Ok((schema, take))
 }
@@
-                    if is_array_format {
-                        infer_json_schema_from_json_array(&mut reader, records_to_read)?
-                    } else {
+                    if is_array_format {
+                        let (schema, used) =
+                            infer_json_schema_from_json_array(&mut reader, records_to_read)?;
+                        records_to_read = records_to_read.saturating_sub(used);
+                        schema
+                    } else {
                         let iter = ValueIter::new(&mut reader, None);
                         infer_json_schema_from_iterator(
                             iter.take_while(|_| take_while()),
                         )?
                     }
@@
-                    if is_array_format {
-                        infer_json_schema_from_json_array(&mut reader, records_to_read)?
-                    } else {
+                    if is_array_format {
+                        let (schema, used) =
+                            infer_json_schema_from_json_array(&mut reader, records_to_read)?;
+                        records_to_read = records_to_read.saturating_sub(used);
+                        schema
+                    } else {
                         let iter = ValueIter::new(&mut reader, None);
                         infer_json_schema_from_iterator(
                             iter.take_while(|_| take_while()),
                         )?
                     }

🧹 Nitpick comments (5)

datafusion-examples/examples/custom_data_source/csv_json_opener.rs (1)
123-129: Consider adding a comment to clarify the boolean parameter.

In example code, a bare false doesn't convey its purpose to readers. Since this PR introduces the format_array flag, adding a brief comment would help users understand the distinction between NDJSON and array-format JSON.
📝 Suggested improvement
     let opener = JsonOpener::new(
         8192,
         projected,
         FileCompressionType::UNCOMPRESSED,
         Arc::new(object_store),
-        false,
+        false, // format_array: false for NDJSON (newline-delimited), true for JSON array format
     );
datafusion/datasource-json/src/source.rs (4)
266-282: Consider optimizing memory usage for the streaming path.

The streaming path accumulates all bytes into a Vec<u8>, then the decompressed reader is passed to read_json_array_to_batches, which calls read_to_string creating another copy. This results in multiple in-memory copies of the data.

For large files, consider passing the bytes directly to avoid the extra copy, or using a streaming JSON parser that can handle arrays incrementally.

308-308: Remove redundant import.

ReaderBuilder is already imported at line 40 via use arrow::json::ReaderBuilder;. This inner import is unnecessary.
Proposed fix
 fn read_json_array_to_batches<R: Read>(
     mut reader: R,
     schema: SchemaRef,
     batch_size: usize,
 ) -> Result<Vec<RecordBatch>> {
-    use arrow::json::ReaderBuilder;
-
     let mut content = String::new();
310-333: Consider adding a clearer error message for non-array JSON input.

If a user accidentally provides a JSON object (e.g., {"key": "value"}) instead of a JSON array, the current code produces a generic serde deserialization error. Consider wrapping with a more descriptive error:
Proposed improvement
     // Parse JSON array
-    let values: Vec<serde_json::Value> = serde_json::from_str(&content)
-        .map_err(|e| DataFusionError::External(Box::new(e)))?;
+    let values: Vec<serde_json::Value> = serde_json::from_str(&content).map_err(|e| {
+        DataFusionError::External(format!(
+            "Failed to parse JSON array (ensure input is a JSON array, not an object): {e}"
+        ).into())
+    })?;
321-326: Memory optimization opportunity: avoid double serialization.

The current implementation parses the entire JSON array into Vec<serde_json::Value>, then serializes each value back to strings to create NDJSON. This involves:

content String (file content)

Vec<serde_json::Value> (parsed representation)

ndjson String (re-serialized)

For large files, this can use ~3x the file size in peak memory.

A more efficient approach would be to write directly to a buffer while iterating, or consider if Arrow's RawDecoder can handle the values directly without the NDJSON round-trip.
Slightly improved version using a single buffer
-    // Convert to NDJSON string for arrow-json reader
-    let ndjson: String = values
-        .iter()
-        .map(|v| v.to_string())
-        .collect::<Vec<_>>()
-        .join("\n");
+    // Convert to NDJSON format for arrow-json reader
+    let mut ndjson = String::new();
+    for (i, v) in values.iter().enumerate() {
+        if i > 0 {
+            ndjson.push('\n');
+        }
+        // Write directly to ndjson to avoid intermediate Vec<String>
+        use std::fmt::Write;
+        write!(&mut ndjson, "{}", v).expect("String write cannot fail");
+    }

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-23T08:02:56Z

datafusion/datasource-json/src/file_format.rs

-                    infer_json_schema_from_iterator(iter.take_while(|_| take_while()))?
+
+                    if is_array_format {
+                        infer_json_schema_from_json_array(&mut reader, records_to_read)?


Schema inference ignores record limit across files

Medium Severity

When inferring schema from multiple JSON array files with schema_infer_max_rec configured, the code doesn't properly track the remaining records to read across files. Unlike the NDJSON path which decrements records_to_read via the take_while closure, the array format path passes records_to_read to infer_json_schema_from_json_array but never updates it afterward. This causes all files to be processed instead of stopping after reading the specified maximum number of records.

gemini-code-assist

Code Review

This pull request introduces support for reading JSON files in array format, in addition to the existing newline-delimited JSON (NDJSON) format. The changes include adding a format_array option to JsonOptions and NdJsonReadOptions, updating the schema inference logic, and modifying the data reading path to handle JSON arrays. New test cases have been added to cover the new functionality, including empty arrays, schema inference limits, data reading, projection, and compression. The protobuf definitions and their serialization/deserialization logic have also been updated to reflect these changes.

gemini-code-assist · 2026-01-23T08:04:14Z

datafusion/datasource-json/src/file_format.rs

+    reader.read_to_string(&mut content).map_err(|e| {
+        ArrowError::JsonError(format!("Failed to read JSON content: {e}"))
+    })?;


The infer_json_schema_from_json_array function reads the entire file content into a String here. For very large JSON array files, this could lead to Out-Of-Memory (OOM) issues, especially during schema inference where only a subset of records might be needed. Consider streaming the data or using a more memory-efficient parsing approach if possible, particularly for schema inference where max_records is intended to limit the data scanned.

gemini-code-assist · 2026-01-23T08:04:14Z

datafusion/datasource-json/src/source.rs

+                        // For streaming, we need to collect all bytes first
+                        let bytes = s
+                            .map_err(DataFusionError::from)
+                            .try_fold(Vec::new(), |mut acc, chunk| async move {
+                                acc.extend_from_slice(&chunk);
+                                Ok(acc)
+                            })
+                            .await?;


For streaming payloads, the entire stream is collected into a Vec<u8> before processing. This can lead to Out-Of-Memory (OOM) errors for large JSON array files, similar to the schema inference logic. While format_array is a new feature, handling large files efficiently is crucial for a data processing framework. Consider if there's a way to process the stream incrementally without loading the entire file into memory.

gemini-code-assist · 2026-01-23T08:04:14Z

datafusion/datasource-json/src/source.rs

+    let mut content = String::new();
+    reader.read_to_string(&mut content)?;


The read_json_array_to_batches function reads the entire file content into a String and then parses it with serde_json::from_str. This approach, while functional, can be memory-intensive for large files. Additionally, converting the parsed serde_json::Values back into an NDJSON string and then using arrow::json::ReaderBuilder to build record batches introduces an unnecessary intermediate serialization/deserialization step. It would be more efficient to directly convert serde_json::Value into Arrow arrays or to use a parser that can directly produce Arrow arrays from a JSON array structure without intermediate string conversion.

martin-augment · 2026-01-23T14:22:27Z

Superseded by #206

array json support

590f97e

augmentcode bot reviewed Jan 23, 2026

View reviewed changes

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

cursor bot reviewed Jan 23, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 23, 2026

View reviewed changes

martin-augment closed this Jan 23, 2026

		let mut content = String::new();
		reader.read_to_string(&mut content)?;

Conversation

martin-augment commented Jan 23, 2026

Uh oh!

gemini-code-assist bot commented Jan 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Jan 23, 2026

PR Review: Support JSON Arrays Reader/Parse for DataFusion

Overview

Code Quality and Best Practices

Potential Bugs and Issues

Performance Considerations

Security Concerns

Test Coverage

Recommendation

Uh oh!

augmentcode bot commented Jan 23, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 23, 2026

Choose a reason for hiding this comment

Schema inference ignores record limit across files

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 23, 2026 •

edited

Loading