Skip to content

Conversation

@martin-augment
Copy link
Owner

19931: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Jan 22, 2026

Walkthrough

This change introduces a new public field single_file_output: Option<bool> to the FileSinkConfig struct, enabling explicit configuration of single-file versus directory output behavior. The field is propagated through write paths for CSV, JSON, and Parquet formats, parsed from COPY command options in the physical planner, and honored in the demux logic with a fallback to extension-based heuristics. Initialization sites across the codebase are updated to include this field, and new tests validate the behavior for paths without file extensions.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 22, 2026

Code Review for PR #203

Summary

This PR fixes a regression where DataFrameWriteOptions::with_single_file_output(true) was being ignored for paths without file extensions. The fix introduces an explicit single_file_output: Option<bool> field to FileSinkConfig that overrides the existing extension-based heuristic.

Code Quality: ✅ Good

Strengths

  1. Clear separation of concerns: The PR properly distinguishes between extension heuristic (default) and explicit user intent via with_single_file_output()
  2. Consistent implementation: Pattern applied uniformly across DataFrame write methods, physical planner, and all sink configurations
  3. Good documentation: Clear inline documentation explaining the three states
  4. Backward compatibility: Using Option<bool> with None as default preserves existing behavior

Minor Issues

  1. Repetitive code in dataframe/mod.rs (lines 2051-2055, 2120-2126): The copy_options building logic is duplicated across write_csv() and write_json(). Consider extracting to a helper method.
  2. Inconsistent HashMap construction: In parquet.rs:89 uses HashMap::<String, String>::new(), while mod.rs uses HashMap::new(). Prefer simpler form for consistency.

Potential Bugs: ⚠️ Minor Concern

  1. String parsing in physical_planner.rs (lines 553-565): Parsing is case-sensitive. Consider eq_ignore_ascii_case() for "true"/"false"
  2. Missing validation: When single_file_output: Some(false) is set with partitioning disabled, behavior might be unexpected

Performance: ✅ Good

  • Minimal performance impact
  • unwrap_or_else() in demux.rs:110 only runs when needed
  • String operations only during planning, not execution

Security: ✅ No Concerns

  • No new vulnerabilities
  • Proper error handling for invalid values

Test Coverage: ✅ Excellent

The new test test_single_file_output_without_extension() directly tests the regression scenario and verifies both file system state and parquet validity.

Suggestions:

  1. Add test for with_single_file_output(false) to verify directory output
  2. Add test for paths with extensions to ensure heuristic still works
  3. Add similar tests for CSV/JSON formats

Architecture: ✅ Well Designed

The tri-state Option<bool> design elegantly handles backward compatibility while allowing explicit control.

Recommendations

High Priority

  1. Add tests for CSV and JSON to ensure parity across formats
  2. Extract duplicate code in dataframe/mod.rs

Medium Priority

  1. Add test for with_single_file_output(false) case
  2. Document edge cases

Low Priority

  1. Case-insensitive string parsing
  2. Add inline examples to docs

Overall Assessment: ✅ APPROVED

Solid fix addressing a clear regression while maintaining backward compatibility. Implementation is consistent and well-tested for the primary use case.

Score: 8.5/10

@augmentcode
Copy link

augmentcode bot commented Jan 22, 2026

🤖 Augment PR Summary

Summary: This PR ensures DataFrameWriteOptions::with_single_file_output(true) is honored even when the output path has no file extension.

Changes:

  • Plumbs single_file_output from DataFrame write APIs into the COPY TO logical plan options
  • Parses single_file_output in the physical planner and threads it through FileSinkConfig
  • Updates the demuxer to prefer an explicit single_file_output override over the extension-based heuristic
  • Extends sink/config initializations and proto deserialization defaults to preserve backward-compatible behavior
  • Adds a regression test verifying a no-extension path produces a single Parquet file

Technical Notes: The default behavior remains heuristic-based unless an explicit single_file_output option is provided.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

/// - `None`: use extension heuristic (path with extension = single file)
/// - `Some(true)`: force single file output at exact path
/// - `Some(false)`: force directory output with generated filenames
pub single_file_output: Option<bool>,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSinkConfig is a public struct, so adding the single_file_output field is a breaking API change for downstream crates that construct it with a struct literal.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:valid-but-wont-fix; category:bug; feedback:The Augment AI reviewer is correct that the addition of the new field breaks the public API but this is not really an issue for Apache DataFusion because its release policy is to release major versions which allow breaking the API. Minor versions are released only when there are big issues with a freshly released major version.

@martin-augment
Copy link
Owner Author

  • String parsing in physical_planner.rs (lines 553-565): Parsing is case-sensitive. Consider eq_ignore_ascii_case() for "true"/"false"

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct that by using eq_ignore_ascii_case() the reading of the new config property value would be more user-friendly and accept upper-cased/capitalized

@martin-augment
Copy link
Owner Author

2. Missing validation: When single_file_output: Some(false) is set with partitioning disabled, behavior might be unexpected

value:incorrect-but-reasonable; category:bug; feedback:The Claude AI reviewer is not correct! "Some(false)" means that the file writer should use the provided path as a file system directory. If partitioning is disabled then this directory will contain only one file for the single partition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants