Skip to content

Conversation

@martin-augment
Copy link
Owner

20001: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the maintenance of DataFusion version numbers across the project's documentation. By modifying an existing Python script, it automates the process of updating version strings in critical user-facing files. This change ensures that installation instructions and code examples in the documentation consistently reflect the latest DataFusion release, thereby improving accuracy and reducing the potential for outdated information.

Highlights

  • Automated Documentation Version Updates: The Python script dev/update_datafusion_versions.py has been enhanced to automatically update DataFusion version numbers within various documentation files, reducing manual effort and ensuring consistency.
  • Improved Version String Matching: The update_docs function in the script now uses more robust regular expressions to correctly identify and update DataFusion version strings, supporting both direct string assignments (e.g., datafusion = "X.Y.Z") and table-like version specifications (e.g., datafusion = { version = "X.Y.Z").
  • Expanded Documentation Coverage: The script's automatic version update process now includes additional key documentation files: docs/source/download.md, docs/source/user-guide/example-usage.md, and docs/source/user-guide/crate-configuration.md.
  • Documentation File Adjustments: The datafusion-docs entry was removed from the list of workspace versions to update in the script, as it does not correspond to a Cargo.toml file that requires version management in this context.
  • Manual Version Bumps in Docs: The pull request also includes direct updates to the DataFusion version in the aforementioned documentation files, setting them to 52.0.0 to reflect the current state.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Walkthrough

This pull request updates the version management tooling and documentation references for DataFusion. The dev/update_datafusion_versions.py script is modified to remove the datafusion-docs entry from the crates mapping and enhance the regex patterns in the update_docs function to use non-greedy matching and handle nested dependency syntax. The script is also expanded to update DataFusion version references in three additional documentation files. Corresponding updates to documentation files change the example DataFusion version references to 52.0.0 across download instructions, crate configuration examples, and usage documentation.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request automates the process of updating the DataFusion version within documentation files by enhancing a Python script. The changes introduce new regular expressions to accommodate various version formats and expand the list of documentation files to be updated. My review provides suggestions to improve the robustness of the regular expressions and to refactor the script for better maintainability.

Comment on lines +118 to +119
content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The two re.sub calls can be combined into a single, more robust call. This new regex will handle optional whitespace and both version specification formats (datafusion = "..." and datafusion = { version = "..." }). Using a lambda function for the replacement makes the code clearer and avoids potential escaping issues. This simplifies the code and makes it more resilient to formatting variations.

Suggested change
content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
content = re.sub(r'(datafusion\s*=\s*(?:{\s*version\s*=\s*)?)"(.+?)"', lambda m: f'{m.group(1)}"{new_version}"', content)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-but-wont-fix; category:bug; feedback:The Gemini AI reviewer is correct that the same could be achieved with a single regex pattern but it will be more complex to read and maintain! The regex is used is a script that is executed once per release, so it is not important to be very optimized.

Comment on lines 146 to +149
update_docs("README.md", new_version)
update_docs("docs/source/download.md", new_version)
update_docs("docs/source/user-guide/example-usage.md", new_version)
update_docs("docs/source/user-guide/crate-configuration.md", new_version)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The repeated calls to update_docs can be refactored by putting the file paths into a list and iterating over it. This improves readability and makes it easier to add or remove documentation files in the future.

Suggested change
update_docs("README.md", new_version)
update_docs("docs/source/download.md", new_version)
update_docs("docs/source/user-guide/example-usage.md", new_version)
update_docs("docs/source/user-guide/crate-configuration.md", new_version)
doc_files_to_update = [
"README.md",
"docs/source/download.md",
"docs/source/user-guide/example-usage.md",
"docs/source/user-guide/crate-configuration.md",
]
for doc_file in doc_files_to_update:
update_docs(doc_file, new_version)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Gemini AI reviewer is correct! Using a list of file names and iterating over them in a loop to update each of them would be more easier to read and maintain!

@claude
Copy link

claude bot commented Jan 25, 2026

Code Review for PR #208: Automatically Update DataFusion Version in Docs

Summary

This PR enhances the update_datafusion_versions.py script to automatically update version numbers in documentation files. The changes are focused and appropriate for their purpose.


Code Quality and Best Practices

Strengths:

  • Clean, focused changes that follow the existing code style
  • Removal of 'datafusion-docs': 'docs/Cargo.toml' from the crates dictionary is appropriate (line 51)
  • The regex improvements make the pattern matching more precise with non-greedy quantifiers

Critical Bug - File truncation missing:
The update_docs() function at lines 115-120 doesn't truncate the file after seeking. This could leave old content at the end if the new content is shorter. Should add fd.truncate() after fd.write(content).


Potential Bugs or Issues

  1. File truncation missing - The file may not be properly truncated when new content is shorter than old content
  2. Incomplete replacement - The second regex pattern at line 119 is missing proper handling of the closing brace
  3. No error handling - The function doesn't validate whether files exist or if replacements were made

Performance Considerations

The changes have minimal performance impact. Regex operations are simple and efficient, and adding a few more file updates won't noticeably affect script runtime.


Security Concerns

Low risk overall, but consider:

  • The script doesn't validate the new_version parameter format
  • No path validation when writing to documentation files
  • Could add basic input validation for version string format

Test Coverage

Major concern: No automated tests

  • The script has no unit tests or integration tests
  • No validation that the regex patterns correctly match all expected formats
  • The PR updated actual documentation files but there's no way to verify the script would produce identical results

Recommendations:

  1. Add unit tests for the update_docs() function
  2. Add integration tests that verify the script updates all expected files
  3. Consider adding a dry-run mode for verification

Additional Observations

  1. Inconsistent version numbers: The docs had different versions before this update (41.0.0, 31.0.0, latest_version), suggesting the script hasn't been run consistently
  2. Consider adding validation to verify all documentation files contain the expected version after updates

Recommendation

Approve with modifications - The core changes are sound and improve version update automation. However, please address the file truncation bug before merging.

Priority Fixes

  1. High: Add fd.truncate() after write in update_docs()
  2. Medium: Add input validation for version format
  3. Medium: Add unit tests for the update functions
  4. Low: Improve regex patterns to handle whitespace variations

Co-Authored-By: Claude Sonnet 4.5 [email protected]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dev/update_datafusion_versions.py (1)

113-120: Missing fd.truncate() after write could leave stale content.

When using r+ mode with seek(0) and write(), if the new content is shorter than the original, leftover bytes from the old content will remain at the end of the file. While unlikely with version string replacements (versions typically grow), it's safer to truncate.

Proposed fix
 def update_docs(path: str, new_version: str):
     print(f"updating docs in {path}")
     with open(path, 'r+') as fd:
         content = fd.read()
         fd.seek(0)
         content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
         content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
         fd.write(content)
+        fd.truncate()

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

fd.seek(0)
content = re.sub(r'datafusion = "(.+)"', f'datafusion = "{new_version}"', content)
content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing file truncation causes file corruption

High Severity

The update_docs function opens files with 'r+' mode, reads content, seeks to the beginning, and writes modified content without calling fd.truncate(). When the replacement makes content shorter (e.g., "latest_version" to "52.0.0" in example-usage.md loses 8 characters), leftover bytes from the original file remain at the end, corrupting the file. This PR adds a call to process example-usage.md which triggers this corruption.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

@augmentcode
Copy link

augmentcode bot commented Jan 25, 2026

🤖 Augment PR Summary

Summary: Extends the version update script to also refresh DataFusion version snippets in the docs.
Changes: Updates the docs replacement regexes (including inline-table form) and applies them to additional documentation pages, updating examples to 52.0.0.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

content = re.sub(r'datafusion = "(.+)"', f'datafusion = "{new_version}"', content)
content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
fd.write(content)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_docs opens files with r+, seeks to 0, and writes the new content but never truncates; if the replacement ever makes the file shorter, stale trailing bytes will remain at the end of the file.

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
fd.write(content)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r'datafusion = { version = ...' contains an unescaped {, which is a special regex character in Python and can raise re.error at runtime (or match unexpectedly).

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎

@martin-augment
Copy link
Owner Author

Critical Bug - File truncation missing: The update_docs() function at lines 115-120 doesn't truncate the file after seeking. This could leave old content at the end if the new content is shorter. Should add fd.truncate() after fd.write(content).

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

@martin-augment
Copy link
Owner Author

113-120: Missing fd.truncate() after write could leave stale content.

When using r+ mode with seek(0) and write(), if the new content is shorter than the original, leftover bytes from the old content will remain at the end of the file. While unlikely with version string replacements (versions typically grow), it's safer to truncate.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants