Skip to content

20067: fix: race condition in SpillPool caused by buffered stream#219

Open
martin-augment wants to merge 2 commits intomainfrom
pr-20067-2026-01-30-12-27-30
Open

20067: fix: race condition in SpillPool caused by buffered stream#219
martin-augment wants to merge 2 commits intomainfrom
pr-20067-2026-01-30-12-27-30

Conversation

@martin-augment
Copy link
Owner

20067: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical race condition within the SpillPool by refactoring how data streams are handled during spill operations. The core change involves discontinuing the use of a specific buffered stream utility, which was identified as the root cause of the concurrency issue. This ensures more reliable and stable spill management by simplifying the stream processing logic.

Highlights

  • Race Condition Fix: This pull request resolves a race condition in the SpillPool by removing the use of a buffered stream, which was identified as the underlying cause of the concurrency issue.
  • Removal of spawn_buffered: The spawn_buffered utility function is no longer used within the SpillManager::read_spill method, and its corresponding import has been removed from the file.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

Walkthrough

The change removes the buffering wrapper from the SpillManager::read_spill_as_stream method in the spill manager module. The spawn_buffered import is eliminated and the stream construction is simplified to return the raw stream directly without buffering. The import list is updated to only include metrics. This is an internal implementation change with no impact to the public API signatures.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20067-2026-01-30-12-27-30

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition in the SpillPool by removing the spawn_buffered call when reading from SpillReaderStream. The SpillReaderStream is designed to read one batch at a time without internal buffering, and the spawn_buffered utility was previously used to add an external buffering layer. Removing this external buffering directly resolves the identified race condition, aligning the behavior with the intended non-buffered nature of the SpillReaderStream for spill file reads.

@claude
Copy link

claude bot commented Jan 30, 2026

Code Review: Fix race condition in SpillPool caused by buffered stream

Summary

This PR addresses a race condition in the SpillPool by removing the use of spawn_buffered in SpillManager::read_spill_as_stream. The change is minimal but has important implications for the concurrency behavior of spill operations.


Analysis

The Change

The PR removes the buffering layer (spawn_buffered) that was previously wrapping the stream returned by read_spill_as_stream in datafusion/physical-plan/src/spill/spill_manager.rs:189.

Before:

Ok(spawn_buffered(stream, self.batch_read_buffer_capacity))

After:

Ok(stream)

Race Condition Context

Based on the codebase analysis:

  1. SpillPool Architecture: The SpillPool (in spill_pool.rs) implements a producer-consumer pattern with careful lock coordination between writers and readers.

  2. The Problem with spawn_buffered: The spawn_buffered function (in common.rs:94) spawns the stream execution to a separate tokio task with an intermediate buffer. This creates an additional layer of asynchronous buffering that can interfere with the SpillPool's lock-based coordination mechanism.

  3. SpillFile Reader: Looking at SpillFile::poll_next in spill_pool.rs:535-546, the code carefully creates reader streams inside its poll implementation while holding locks at specific points. Adding spawn_buffered on top introduces a spawned task that polls the stream independently, potentially causing:

    • Timing issues between lock acquisition and stream polling
    • Unexpected interaction with the waker registration mechanism
    • Concurrent access patterns that weren't anticipated by the SpillPool design
  4. Existing Test: The test spill_reader_stream_yield() in core/tests/execution/coop.rs:274 was specifically added to verify cooperative yielding behavior. This test actually uses spawn_buffered to simulate the old behavior, suggesting the issue is related to buffering interaction with cooperative scheduling.


Code Quality Assessment

✅ Strengths

  1. Minimal change: The fix is surgical and doesn't introduce unnecessary complexity
  2. Correct approach: Removing the buffering layer aligns with the comment in spill/mod.rs:55 which states the stream "will read one batch at a time and will not do any buffering"
  3. Clean removal: Both the usage and the import of spawn_buffered are removed

⚠️ Concerns

  1. Dead code: The field batch_read_buffer_capacity (line 45) is now unused but still exists:

    • It's set in the constructor (line 56)
    • Has a setter method with_batch_read_buffer_capacity (line 61-67)
    • Is still used in multi_level_merge.rs:276 via .with_batch_read_buffer_capacity(buffer_size)

    Recommendation: Either remove this field and its associated methods in a follow-up PR, or add a comment explaining why it's kept for future use.

  2. Documentation gap: The comment in spill/mod.rs:55 mentions using spawn_buffered for buffering, but this is no longer applicable:

    /// It will read one batch at a time and will not do any buffering, 
    /// to buffer data use [`crate::common::spawn_buffered`]
    

    Recommendation: Update or remove this comment reference to spawn_buffered in a follow-up.

  3. Missing test coverage: While existing tests exist, there doesn't appear to be a specific test that validates the race condition is fixed.

    Recommendation: Consider adding a test that specifically exercises the concurrent SpillPool scenario that was failing before this fix (if reproducible).


Performance Considerations

Potential Impact: Removing buffering could theoretically reduce throughput in some scenarios since batches are now read synchronously rather than being pre-fetched in a background task.

However:

  • The SpillReaderStream already uses cooperative() wrapper for proper yielding
  • The SpillPool has its own coordination mechanism
  • The multi_level_merge.rs:276 code suggests that callers can control buffering at a higher level if needed
  • The race condition fix is more important than potential micro-optimizations

Verdict: The performance trade-off is acceptable and likely negligible in practice.


Security Concerns

No security issues identified. The change reduces complexity which generally improves security posture.


Test Coverage

Existing tests: The codebase has comprehensive spill-related tests:

  • Multiple tests in spill/mod.rs (lines 390, 454, 485, 520, 778, 780)
  • Cooperative yielding test in coop.rs:274
  • Benchmark tests in spill_io.rs

Recommendation: Ensure all existing tests pass with this change. The test at coop.rs:274 is particularly important as it specifically validates yielding behavior with buffered streams.


Overall Assessment

Rating: ✅ Approve with minor follow-up suggestions

Summary:

  • The fix correctly addresses the race condition by removing the problematic buffering layer
  • The change is minimal and well-targeted
  • No new bugs or security issues introduced
  • Minor cleanup of dead code and documentation would be beneficial in follow-up PRs

Recommended Actions:

  1. ✅ Merge this PR as-is to fix the race condition
  2. 📝 File follow-up issue to clean up unused batch_read_buffer_capacity field and methods
  3. 📝 Update documentation in spill/mod.rs:55 to remove reference to spawn_buffered
  4. 🧪 Consider adding a regression test if the race condition is reproducible

Reviewed files:

  • datafusion/physical-plan/src/spill/spill_manager.rs

Related code examined:

  • datafusion/physical-plan/src/common.rs (spawn_buffered implementation)
  • datafusion/physical-plan/src/spill/spill_pool.rs (SpillPool reader/writer coordination)
  • datafusion/physical-plan/src/spill/mod.rs (SpillReaderStream)
  • datafusion/core/tests/execution/coop.rs (cooperative yielding tests)

@augmentcode
Copy link

augmentcode bot commented Jan 30, 2026

🤖 Augment PR Summary

Summary: Removes buffered wrapping around spill-file reads to avoid a race condition when consuming spill output.
Changes: SpillManager::read_spill_as_stream now returns the raw cooperative stream (no spawn_buffered).

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

)));

Ok(spawn_buffered(stream, self.batch_read_buffer_capacity))
Ok(stream)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_spill_as_stream no longer uses self.batch_read_buffer_capacity, so with_batch_read_buffer_capacity is now a no-op (e.g. datafusion/physical-plan/src/sorts/multi_level_merge.rs:276) and the struct field/doc comment is now misleading.

Other Locations
  • datafusion/physical-plan/src/sorts/multi_level_merge.rs:276

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The batch_read_buffer_capacity field is no more read and it could be either removed or at least marked as deprecated. Prevents maintaining dead code.

@martin-augment
Copy link
Owner Author

Dead code: The field batch_read_buffer_capacity (line 45) is now unused but still exists:

  • It's set in the constructor (line 56)
  • Has a setter method with_batch_read_buffer_capacity (line 61-67)
  • Is still used in multi_level_merge.rs:276 via .with_batch_read_buffer_capacity(buffer_size)

Recommendation: Either remove this field and its associated methods in a follow-up PR, or add a comment explaining why it's kept for future use.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The batch_read_buffer_capacity field is no more read and it could be either removed or at least marked as deprecated. Prevents maintaining dead code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants