20001: docs: Automatically update DataFusion version in docs #208

martin-augment · 2026-01-25T20:41:01Z

20001: To review by AI

gemini-code-assist · 2026-01-25T20:41:17Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the maintenance of DataFusion version numbers across the project's documentation. By modifying an existing Python script, it automates the process of updating version strings in critical user-facing files. This change ensures that installation instructions and code examples in the documentation consistently reflect the latest DataFusion release, thereby improving accuracy and reducing the potential for outdated information.

Highlights

Automated Documentation Version Updates: The Python script dev/update_datafusion_versions.py has been enhanced to automatically update DataFusion version numbers within various documentation files, reducing manual effort and ensuring consistency.
Improved Version String Matching: The update_docs function in the script now uses more robust regular expressions to correctly identify and update DataFusion version strings, supporting both direct string assignments (e.g., datafusion = "X.Y.Z") and table-like version specifications (e.g., datafusion = { version = "X.Y.Z").
Expanded Documentation Coverage: The script's automatic version update process now includes additional key documentation files: docs/source/download.md, docs/source/user-guide/example-usage.md, and docs/source/user-guide/crate-configuration.md.
Documentation File Adjustments: The datafusion-docs entry was removed from the list of workspace versions to update in the script, as it does not correspond to a Cargo.toml file that requires version management in this context.
Manual Version Bumps in Docs: The pull request also includes direct updates to the DataFusion version in the aforementioned documentation files, setting them to 52.0.0 to reflect the current state.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-25T20:41:55Z

Walkthrough

This pull request updates the version management tooling and documentation references for DataFusion. The dev/update_datafusion_versions.py script is modified to remove the datafusion-docs entry from the crates mapping and enhance the regex patterns in the update_docs function to use non-greedy matching and handle nested dependency syntax. The script is also expanded to update DataFusion version references in three additional documentation files. Corresponding updates to documentation files change the example DataFusion version references to 52.0.0 across download instructions, crate configuration examples, and usage documentation.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request automates the process of updating the DataFusion version within documentation files by enhancing a Python script. The changes introduce new regular expressions to accommodate various version formats and expand the list of documentation files to be updated. My review provides suggestions to improve the robustness of the regular expressions and to refactor the script for better maintainability.

gemini-code-assist · 2026-01-25T20:42:37Z

dev/update_datafusion_versions.py

+        content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
+        content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)


The two re.sub calls can be combined into a single, more robust call. This new regex will handle optional whitespace and both version specification formats (datafusion = "..." and datafusion = { version = "..." }). Using a lambda function for the replacement makes the code clearer and avoids potential escaping issues. This simplifies the code and makes it more resilient to formatting variations.

Suggested change

content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)

content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)

content = re.sub(r'(datafusion\s*=\s*(?:{\s*version\s*=\s*)?)"(.+?)"', lambda m: f'{m.group(1)}"{new_version}"', content)

value:good-but-wont-fix; category:bug; feedback:The Gemini AI reviewer is correct that the same could be achieved with a single regex pattern but it will be more complex to read and maintain! The regex is used is a script that is executed once per release, so it is not important to be very optimized.

gemini-code-assist · 2026-01-25T20:42:37Z

dev/update_datafusion_versions.py

    update_docs("README.md", new_version)
+    update_docs("docs/source/download.md", new_version)
+    update_docs("docs/source/user-guide/example-usage.md", new_version)
+    update_docs("docs/source/user-guide/crate-configuration.md", new_version)


The repeated calls to update_docs can be refactored by putting the file paths into a list and iterating over it. This improves readability and makes it easier to add or remove documentation files in the future.

Suggested change

update_docs("README.md", new_version)

update_docs("docs/source/download.md", new_version)

update_docs("docs/source/user-guide/example-usage.md", new_version)

update_docs("docs/source/user-guide/crate-configuration.md", new_version)

doc_files_to_update = [

"README.md",

"docs/source/download.md",

"docs/source/user-guide/example-usage.md",

"docs/source/user-guide/crate-configuration.md",

]

for doc_file in doc_files_to_update:

update_docs(doc_file, new_version)

value:good-to-have; category:bug; feedback:The Gemini AI reviewer is correct! Using a list of file names and iterating over them in a loop to update each of them would be more easier to read and maintain!

claude · 2026-01-25T20:43:42Z

Code Review for PR #208: Automatically Update DataFusion Version in Docs

Summary

This PR enhances the update_datafusion_versions.py script to automatically update version numbers in documentation files. The changes are focused and appropriate for their purpose.

Code Quality and Best Practices

Strengths:

Clean, focused changes that follow the existing code style
Removal of 'datafusion-docs': 'docs/Cargo.toml' from the crates dictionary is appropriate (line 51)
The regex improvements make the pattern matching more precise with non-greedy quantifiers

Critical Bug - File truncation missing:
The update_docs() function at lines 115-120 doesn't truncate the file after seeking. This could leave old content at the end if the new content is shorter. Should add fd.truncate() after fd.write(content).

Potential Bugs or Issues

File truncation missing - The file may not be properly truncated when new content is shorter than old content
Incomplete replacement - The second regex pattern at line 119 is missing proper handling of the closing brace
No error handling - The function doesn't validate whether files exist or if replacements were made

Performance Considerations

The changes have minimal performance impact. Regex operations are simple and efficient, and adding a few more file updates won't noticeably affect script runtime.

Security Concerns

Low risk overall, but consider:

The script doesn't validate the new_version parameter format
No path validation when writing to documentation files
Could add basic input validation for version string format

Test Coverage

Major concern: No automated tests

The script has no unit tests or integration tests
No validation that the regex patterns correctly match all expected formats
The PR updated actual documentation files but there's no way to verify the script would produce identical results

Recommendations:

Add unit tests for the update_docs() function
Add integration tests that verify the script updates all expected files
Consider adding a dry-run mode for verification

Additional Observations

Inconsistent version numbers: The docs had different versions before this update (41.0.0, 31.0.0, latest_version), suggesting the script hasn't been run consistently
Consider adding validation to verify all documentation files contain the expected version after updates

Recommendation

Approve with modifications - The core changes are sound and improve version update automation. However, please address the file truncation bug before merging.

Priority Fixes

High: Add fd.truncate() after write in update_docs()
Medium: Add input validation for version format
Medium: Add unit tests for the update functions
Low: Improve regex patterns to handle whitespace variations

Co-Authored-By: Claude Sonnet 4.5 [email protected]

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dev/update_datafusion_versions.py (1)
113-120: Missing fd.truncate() after write could leave stale content.

When using r+ mode with seek(0) and write(), if the new content is shorter than the original, leftover bytes from the old content will remain at the end of the file. While unlikely with version string replacements (versions typically grow), it's safer to truncate.
Proposed fix
 def update_docs(path: str, new_version: str):
     print(f"updating docs in {path}")
     with open(path, 'r+') as fd:
         content = fd.read()
         fd.seek(0)
         content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
         content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
         fd.write(content)
+        fd.truncate()

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-25T20:46:43Z

dev/update_datafusion_versions.py

        fd.seek(0)
-        content = re.sub(r'datafusion = "(.+)"', f'datafusion = "{new_version}"', content)
+        content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
+        content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)


Missing file truncation causes file corruption

High Severity

The update_docs function opens files with 'r+' mode, reads content, seeks to the beginning, and writes modified content without calling fd.truncate(). When the replacement makes content shorter (e.g., "latest_version" to "52.0.0" in example-usage.md loses 8 characters), leftover bytes from the original file remain at the end, corrupting the file. This PR adds a call to process example-usage.md which triggers this corruption.

Additional Locations (1)

dev/update_datafusion_versions.py#L147-L148

value:useful; category:bug; feedback:The Bugbot AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

augmentcode · 2026-01-25T20:47:10Z

🤖 Augment PR Summary

Summary: Extends the version update script to also refresh DataFusion version snippets in the docs.
Changes: Updates the docs replacement regexes (including inline-table form) and applies them to additional documentation pages, updating examples to 52.0.0.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-25T20:47:11Z

dev/update_datafusion_versions.py

-        content = re.sub(r'datafusion = "(.+)"', f'datafusion = "{new_version}"', content)
+        content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
+        content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
        fd.write(content)


update_docs opens files with r+, seeks to 0, and writes the new content but never truncates; if the replacement ever makes the file shorter, stale trailing bytes will remain at the end of the file.

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

augmentcode · 2026-01-25T20:47:11Z

dev/update_datafusion_versions.py

+        content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
+        content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
        fd.write(content)



The regex pattern r'datafusion = { version = ...' contains an unescaped {, which is a special regex character in Python and can raise re.error at runtime (or match unexpectedly).

_{🤖 Was this useful? React with 👍 or 👎}

martin-augment · 2026-01-25T21:11:26Z

Critical Bug - File truncation missing: The update_docs() function at lines 115-120 doesn't truncate the file after seeking. This could leave old content at the end if the new content is shorter. Should add fd.truncate() after fd.write(content).

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

martin-augment · 2026-01-25T21:12:35Z

113-120: Missing fd.truncate() after write could leave stale content.

When using r+ mode with seek(0) and write(), if the new content is shorter than the original, leftover bytes from the old content will remain at the end of the file. While unlikely with version string replacements (versions typically grow), it's safer to truncate.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The file should be truncated before or after writing the new content. Otherwise it may leave some extra characters at the end from the old content.

docs: Automatically update DataFusion version in docs

a651975

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

cursor bot reviewed Jan 25, 2026

View reviewed changes

augmentcode bot reviewed Jan 25, 2026

View reviewed changes

		content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
		content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)

	content = re.sub(r'datafusion = "(.+?)"', f'datafusion = "{new_version}"', content)
	content = re.sub(r'datafusion = { version = "(.+?)"', f'datafusion = {{ version = "{new_version}"', content)
	content = re.sub(r'(datafusion\s=\s(?:{\sversion\s=\s*)?)"(.+?)"', lambda m: f'{m.group(1)}"{new_version}"', content)

20001: docs: Automatically update DataFusion version in docs #208

Are you sure you want to change the base?

20001: docs: Automatically update DataFusion version in docs #208

Uh oh!

Conversation

martin-augment commented Jan 25, 2026

Uh oh!

gemini-code-assist bot commented Jan 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Jan 25, 2026

Code Review for PR #208: Automatically Update DataFusion Version in Docs

Summary

Code Quality and Best Practices

Potential Bugs or Issues

Performance Considerations

Security Concerns

Test Coverage

Additional Observations

Recommendation

Priority Fixes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 25, 2026

Choose a reason for hiding this comment

Missing file truncation causes file corruption

Uh oh!

martin-augment Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Jan 25, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Jan 25, 2026

Uh oh!

martin-augment commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

coderabbitai bot commented Jan 25, 2026 •

edited

Loading