Skip to content

7773: io: always cleanup AsyncFd registration list on deregister#47

Open
martin-augment wants to merge 34 commits intomasterfrom
pr-7773-2026-01-05-07-16-58
Open

7773: io: always cleanup AsyncFd registration list on deregister#47
martin-augment wants to merge 34 commits intomasterfrom
pr-7773-2026-01-05-07-16-58

Conversation

@martin-augment
Copy link
Owner

7773: To review by AI

F4RAN and others added 30 commits December 12, 2025 22:28
Fixes memory leak when fd is closed before AsyncFd drop.

Fixes: tokio-rs#7563
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Instead of checking absolute RSS growth (which varies with allocator
behavior), this test now runs multiple phases and checks if memory
stabilizes. A real leak causes unbounded growth across all phases;
fixed code stabilizes as memory is reused.

This approach is more robust across different CI environments where
allocator behavior may differ.
Co-authored-by: Alice Ryhl <aliceryhl@google.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 5, 2026

Walkthrough

This PR modifies the IO driver's deregistration logic and introduces a memory leak regression test. In the IO driver, the deregister_source function changes its error handling pattern to always perform cleanup operations (removing registration, updating metrics, unparking) regardless of OS deregistration success, then return any error. A new test file adds a regression test using a custom global allocator to detect memory leaks when AsyncFd drops after its file descriptor is already closed, measuring heap allocation growth across multiple phases with a 64KB threshold.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 5, 2026

Pull Request Review

Summary

This PR fixes a critical memory leak in AsyncFd (issue tokio-rs#7563) by ensuring that the internal registration list cleanup always occurs during deregistration, even when the OS-level deregister operation fails.

Code Quality & Best Practices ✅

driver.rs changes (tokio/src/runtime/io/driver.rs:298-311):
The fix is well-implemented and follows the principle of RAII (Resource Acquisition Is Initialization):

  • Separation of concerns: The change correctly separates OS-level cleanup from internal state cleanup
  • Clear intent: The comment "Cleanup ALWAYS happens" clearly documents the critical behavior
  • Proper error handling: The fix maintains the original error semantics by returning the OS result after cleanup

The change from:

self.registry.deregister(source)?;  // Early return on error - BAD!
// cleanup code here

To:

let os_result = self.registry.deregister(source);  // Continue even on error
// cleanup code here
os_result  // Return error after cleanup

This is the correct pattern for cleanup operations and follows established best practices in systems programming.

Bug Analysis ✅

Root Cause: When a file descriptor was closed before AsyncFd::drop() was called, the OS-level deregister would fail (EBADF), causing an early return via the ? operator. This prevented:

  1. Removal from the internal registration list (self.registrations.deregister())
  2. Decrementing the FD count metric (self.metrics.dec_fd_count())

Impact: Each leaked registration is ~256 bytes of ScheduledIo metadata, causing unbounded memory growth in applications that repeatedly close FDs before dropping AsyncFd instances.

Fix correctness: The fix ensures cleanup happens regardless of OS errors, which is correct because:

  • Internal state should be cleaned up even if the OS resource is already gone
  • The registration list holds Arc references that need to be released
  • Metrics need to stay accurate

Test Coverage ✅

The test file io_async_fd_memory_leak.rs is excellent:

Strengths:

  1. Custom allocator tracking: Uses a global allocator wrapper to track actual allocated bytes, avoiding RSS measurement issues
  2. Three-phase approach: Warmup phase, then two measurement phases to detect ongoing growth vs stabilization
  3. Realistic reproduction: Accurately reproduces the bug scenario (closing FD before AsyncFd drop)
  4. Platform-specific: Properly gated with #[cfg(all(unix, target_os = "linux"))]
  5. Well-documented: Clear comments explain the purpose and expected behavior

Test methodology:

  • Runs 1000 iterations per phase (3000 total)
  • Compares growth between phases to detect leaks
  • Allows 64KB tolerance for allocator variance
  • Fails if memory keeps growing between phases

Minor suggestions:

  1. Test isolation: This is the only test in tokio using #[global_allocator], which is good (avoids conflicts), but consider adding a comment noting this limitation
  2. Const documentation: Consider adding a comment explaining why 64KB threshold was chosen (roughly 250 leaked ScheduledIo structs)

Example:

// Allow 64KB tolerance for normal allocation variance.
// A real leak would add ~250KB per phase (1000 * ~256 bytes per ScheduledIo)
let threshold = 64 * 1024;

Performance Considerations ✅

No negative impact: The change actually improves performance slightly:

  • Before: Early return meant some cleanup paths were unpredictable
  • After: All cleanup always happens in a single code path

Metrics accuracy: The fix ensures dec_fd_count() is always called, keeping metrics accurate for monitoring and debugging.

Security Concerns ✅

Resource exhaustion: This fix resolves a security concern. The memory leak could be exploited for DoS attacks:

  1. An attacker could trigger the leak pattern repeatedly
  2. Unbounded memory growth leads to OOM
  3. The fix prevents this attack vector

No new vulnerabilities introduced: The change is minimal and only affects cleanup ordering.

Additional Observations

Good practices demonstrated:

  1. Minimal change: The fix touches only the necessary lines
  2. Preserves behavior: Error semantics remain unchanged for callers
  3. Comments: Added helpful inline comments
  4. Issue reference: Title references issue io: always cleanup AsyncFd registration list on deregister tokio-rs/tokio#7773 (note: description mentions The AsyncFd in Tokio has a serious memory leak issue. tokio-rs/tokio#7563, ensure consistency)

Questions:

  1. Issue number mismatch: PR title mentions io: always cleanup AsyncFd registration list on deregister tokio-rs/tokio#7773 but description mentions The AsyncFd in Tokio has a serious memory leak issue. tokio-rs/tokio#7563 for review. Should the title reference The AsyncFd in Tokio has a serious memory leak issue. tokio-rs/tokio#7563 instead?
  2. Backport consideration: Given this is a memory leak fix, should this be backported to stable branches?

Recommendation

APPROVE

This is a well-crafted fix for a serious bug:

  • The code change is correct and minimal
  • The test coverage is excellent and comprehensive
  • No security or performance concerns
  • Follows Tokio's patterns and conventions

The only minor suggestion is to clarify the issue number reference and consider adding a brief comment in the test about the threshold calculation.

Great work on both the fix and the comprehensive test!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tokio/tests/io_async_fd_memory_leak.rs (1)

76-81: Add error checking for fcntl calls.

The function ignores potential fcntl failures. If F_GETFL fails, flags will be -1, and the subsequent F_SETFL will set garbage flags. Consider adding error checks or using the nix-based approach from io_async_fd.rs which includes proper error handling.

🔎 Proposed fix with error checking
 fn set_nonblocking(fd: RawFd) {
     unsafe {
         let flags = libc::fcntl(fd, libc::F_GETFL);
+        assert!(flags != -1, "fcntl(F_GETFL) failed");
-        libc::fcntl(fd, libc::F_SETFL, flags | libc::O_NONBLOCK);
+        let ret = libc::fcntl(fd, libc::F_SETFL, flags | libc::O_NONBLOCK);
+        assert!(ret != -1, "fcntl(F_SETFL) failed");
     }
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a80b4d3 and fdf771e.

📒 Files selected for processing (2)
  • tokio/src/runtime/io/driver.rs
  • tokio/tests/io_async_fd_memory_leak.rs
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-04T09:48:44.541Z
Learnt from: martin-augment
Repo: martin-augment/tokio PR: 11
File: tokio/src/util/usdt/stapsdt_aarch64.rs:1-3
Timestamp: 2025-11-04T09:48:44.541Z
Learning: In USDT (User Statically-Defined Tracing) probe implementations in Rust, semaphores must be declared as `static mut` primitive types (e.g., `static mut __usdt_sema_tokio_task__details: u16`), not atomic types. This is required for ABI compatibility with external tracing tools like DTrace and SystemTap, which modify these semaphores directly. The inline assembly code references the raw address of these semaphores, and using Rust atomic types would break the USDT mechanism.

Applied to files:

  • tokio/tests/io_async_fd_memory_leak.rs
🧬 Code graph analysis (1)
tokio/tests/io_async_fd_memory_leak.rs (1)
tokio/tests/io_async_fd.rs (6)
  • new (41-48)
  • as_raw_fd (65-67)
  • as_raw_fd (244-246)
  • as_raw_fd (935-937)
  • set_nonblocking (102-118)
  • socketpair (120-136)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: claude-review
🔇 Additional comments (5)
tokio/src/runtime/io/driver.rs (1)

298-311: LGTM! Proper fix for the memory leak.

The change correctly ensures that internal cleanup (removing registration from the set, updating metrics, and potentially unparking) always executes regardless of whether the OS-level deregistration succeeds. This is essential because when a file descriptor is closed before AsyncFd is dropped, registry.deregister() fails with EBADF, but the ScheduledIo registration still needs to be cleaned up to prevent leaking memory.

tokio/tests/io_async_fd_memory_leak.rs (4)

16-42: LGTM! Allocator tracking implementation is correct.

The alloc, dealloc, and realloc implementations correctly track the allocation delta. Notably, when realloc fails (returns null), not updating the counter is correct since the original allocation remains valid. Ordering::Relaxed is acceptable here since exact precision isn't required for the 64KB threshold check.


100-112: LGTM! The fd lifecycle pattern correctly simulates the bug scenario.

The pattern of forget(fd_a) followed by manual libc::close() before dropping AsyncFd correctly reproduces the issue tokio-rs#7563 scenario where an external entity closes the file descriptor before AsyncFd is dropped. This is the exact condition that triggers the memory leak this test is designed to catch.


173-187: LGTM! Sound test logic for detecting memory leaks.

The assertion correctly identifies leaks by checking that at least one phase shows minimal growth (below 64KB). If memory is being properly reused, growth should stabilize. The 64KB threshold is appropriate given the ~250KB growth expected per 1000 iterations when leaking (~256 bytes per ScheduledIo).


58-74: LGTM!

The wrapper structs follow the established pattern from io_async_fd.rs and correctly implement AsRawFd delegation for use with AsyncFd.

@augmentcode
Copy link

augmentcode bot commented Jan 5, 2026

🤖 Augment PR Summary

Summary: This PR fixes an internal I/O-driver cleanup path to prevent AsyncFd-related leaks when deregistration fails.

Changes:

  • Updates Handle::deregister_source to always remove the ScheduledIo from Tokio’s internal RegistrationSet and decrement FD metrics, even if the OS-level registry.deregister call returns an error.
  • Preserves the OS deregistration attempt ordering (still invoked first), but defers returning the error until after internal cleanup has run.
  • Adds a Linux-only regression test that reproduces the historical leak scenario by closing the raw FD before dropping AsyncFd, and verifies allocations stabilize using a tracking global allocator.

Technical Notes: The test avoids RSS-based checks (which can be distorted by allocator retention) by measuring outstanding allocation sizes directly via a custom #[global_allocator].

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@martin-augment
Copy link
Owner Author

76-81: Add error checking for fcntl calls.

The function ignores potential fcntl failures. If F_GETFL fails, flags will be -1, and the subsequent F_SETFL will set garbage flags. Consider adding error checks or using the nix-based approach from io_async_fd.rs which includes proper error handling.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The F_GETFL return value should be checked for not being a negative one before using it to set the new flags. Prevents ignoring a syscall failure and continuing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments