Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 4, 2025 8:26pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 4, 2025 8:26pm
rivet-inspector Ignored Ignored Preview Nov 4, 2025 8:26pm
rivet-site Ignored Ignored Preview Nov 4, 2025 8:26pm

Copy link
Member Author

NathanFlurry commented Nov 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 3, 2025

Code Review: PR #3325 - Explicitly handle errors in _onStop

Summary

This PR adds error handling around the _onStop lifecycle hook call in the Engine actor driver to prevent errors from interrupting the actor cleanup process.

Positive Aspects

Critical bug fix: This addresses a real issue where errors thrown during _onStop could prevent proper actor cleanup, potentially leading to memory leaks or actors not being removed from the internal map.

Consistent with patterns in the codebase: The error handling approach using try-catch with stringifyError is consistent with similar patterns already used elsewhere (e.g., line 235-239 in the same file).

Good defensive programming: The error message clearly indicates the error is being logged but the cleanup will proceed, which is the correct behavior for a cleanup handler.

Matches the ActorInstance implementation: Looking at instance.ts:2004-2042, the _onStop method already has internal error handling for the onStop lifecycle hook, so this adds an additional safety layer at the driver level.

Suggestions for Improvement

1. Consider logging the actorId for better debugging

The error log should include the actorId to make debugging easier when multiple actors are running:

logger().error({
    msg: "error in _onStop, proceeding with removing actor",
    actorId,  // Add this for context
    err: stringifyError(err),
});

2. Alignment with instance.ts patterns

In instance.ts:2032-2040, the _onStop implementation distinguishes between DeadlineError and other errors. Consider whether a similar pattern would be useful here, or if the current simpler approach is sufficient for the driver level.

3. Minor: Variable naming consistency

The catch block uses err while some other parts of the codebase use error. Both are fine, but consistency within the file would be ideal. Looking at the codebase, error appears more common (line 235-238).

Testing Considerations

⚠️ Missing test coverage: There don't appear to be tests for the EngineActorDriver in the rivetkit-typescript package. Consider adding tests that:

  • Verify actors are properly removed even when _onStop throws
  • Verify the error is logged appropriately
  • Test that subsequent operations work correctly after a failed _onStop

Security & Performance

✅ No security concerns identified
✅ No performance impact - this only affects the error path
✅ Proper use of existing stringifyError utility

Conclusion

This is a solid bug fix that prevents errors in user-defined cleanup handlers from breaking the actor lifecycle management. The implementation is clean and follows existing patterns.

Recommendation: Approve with the minor suggestion to add actorId to the error log for better debuggability. Consider adding test coverage in a follow-up PR.


Code Quality: ⭐⭐⭐⭐ (4/5)
Impact: High (prevents potential memory leaks and cleanup failures)
Risk: Low (defensive error handling with clear fallback behavior)

@NathanFlurry NathanFlurry force-pushed the 11-03-chore_rivetkit_document_sleep_sequence_better branch from c919694 to 2d6a1e2 Compare November 4, 2025 00:36
@NathanFlurry NathanFlurry force-pushed the 11-03-fix_rivetkit_explicitly_handle_errors_in__onstop branch from 0bb481c to 158182f Compare November 4, 2025 00:36
@claude
Copy link

claude bot commented Nov 4, 2025

Pull Request Review

Summary

This PR adds error handling to the _onStop lifecycle method in the EngineActorDriver to prevent errors during actor shutdown from blocking actor cleanup. This is a defensive programming improvement that ensures actors are removed from the driver's actor map even if their _onStop method throws.


Code Quality ✅

Strengths:

  • Clean, focused change that addresses a specific failure scenario
  • Follows existing error logging patterns in the codebase
  • Uses stringifyError utility consistently (already imported on line 48)
  • Proper error logging with structured context
  • Maintains the critical cleanup operation (this.#actors.delete(actorId)) even when errors occur

Style Compliance:

  • ✅ Follows CLAUDE.md logging conventions: lowercase message, structured logging with err parameter
  • ✅ Import was already present at the top of the file (line 48), not added inline
  • ✅ Proper indentation with hard tabs matching the codebase

Potential Issues & Considerations

1. Error Logging Field Name ⚠️
The code logs the error as err: stringifyError(err), but based on the existing pattern in instance.ts:2036-2038, errors are typically logged with the field name error:

this.#rLog.error({
    msg: "error in onStop",
    error: stringifyError(error),
});

Recommendation: Consider changing to error: stringifyError(err) for consistency with the rest of the codebase, or verify if err vs error is an intentional distinction in this driver.

2. Actor State Inconsistency Risk 🤔
If _onStop throws partway through its execution (lines 2004-2085 in instance.ts), the actor could be in an inconsistent state:

  • Connections may be partially closed
  • State may not be fully saved (line 2064 calls saveState)
  • Background promises may still be pending
  • Write queues may not be drained

While this PR correctly proceeds with cleanup, there's a risk that resources associated with the actor aren't properly released. However, this is likely acceptable given the alternative is worse (actor never gets cleaned up).

3. Missing Test Coverage ⚠️
There don't appear to be tests covering the scenario where _onStop throws an error. Consider adding a test case:

  • Create an actor with an onStop hook that throws
  • Verify the actor is still removed from the driver
  • Verify appropriate error logging occurs
  • Verify subsequent operations work correctly

Performance Considerations ✅

No performance concerns. The try-catch adds negligible overhead and only executes during actor shutdown.


Security Considerations ✅

No security issues identified. The use of stringifyError properly sanitizes error information for logging.


Architecture & Design 👍

Positive:

  • This change aligns with the fail-safe pattern: prioritizing system stability over perfect cleanup
  • Logging the error provides observability while not blocking critical operations
  • The comment "proceeding with removing actor" clearly communicates the intentional behavior

Question:
Would it be valuable to track metrics on _onStop failures? This could help identify actors with problematic shutdown logic. Consider incrementing a counter or emitting a metric when this catch block executes.


Recommendations

High Priority:

  1. Add test coverage for _onStop errors in the driver test suite
  2. Consider standardizing error field name (err vs error)

Medium Priority:
3. Document in comments or docs what resources might leak if _onStop fails
4. Consider adding metrics/telemetry for _onStop failures

Low Priority:
5. Evaluate if there are other lifecycle methods that need similar defensive error handling


Verdict: ✅ Approve with Minor Suggestions

This is a solid defensive programming fix that improves system reliability. The core change is correct and valuable. The suggestions above are primarily about consistency, observability, and test coverage rather than correctness issues.

The change prevents a failure scenario where actors could get stuck in the driver if their cleanup logic throws, which could lead to memory leaks or other issues. Well done on identifying and fixing this edge case!

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: fix(rivetkit): explicitly handle errors in _onStop

Summary

This PR adds error handling around the actor._onStop() call in the EngineActorDriver to prevent errors during actor shutdown from blocking actor cleanup.

Code Quality & Best Practices ✅

Strengths:

  • Uses proper try-catch error handling pattern
  • Follows existing logging conventions with structured logging (logger().error({ msg, err }))
  • Uses stringifyError() utility consistent with codebase patterns
  • Error message is descriptive and indicates the system will proceed with cleanup
  • Correctly imports stringifyError from utils

Consistency with Codebase:

  • Follows the structured logging pattern seen elsewhere in the file (e.g., line 236 in canHibernatWebSocket error handling)
  • Uses lowercase message convention as per CLAUDE.md guidelines

Potential Issues & Recommendations

1. Inconsistent Error Handling Across Drivers ⚠️

The file-system driver at src/drivers/file-system/global-state.ts:328 does not have this error handling:

await actor.actor._onStop();

Recommendation: Apply the same error handling to the file-system driver for consistency and robustness. The error handling should be standardized across all driver implementations.

2. Consider Adding Telemetry/Metrics 💡

Since _onStop errors could indicate bugs in user code or lifecycle issues, consider:

  • Adding a metric/counter for _onStop failures
  • Including actorId in the error log for better debugging
  • Potentially including the actor's name and key if available

Example enhancement:

catch (err) {
    logger().error({
        msg: "error in _onStop, proceeding with removing actor",
        actorId,
        err: stringifyError(err),
    });
}

3. Internal Error Handling in _onStop 🔍

Looking at src/actor/instance.ts:2004-2064, the _onStop method already has internal error handling for the user's onStop hook (lines 2023-2041). However, it can still throw errors from:

  • State persistence operations (saveState at line 2064)
  • Connection disconnection promises (lines 2044-2050)
  • Background promise cleanup (lines 2052-2055)

This PR correctly handles these potential failure points at the driver level, which is good defense-in-depth.

Security Concerns ✅

No security issues identified:

  • Error logging uses stringifyError() which should sanitize sensitive data
  • No credential exposure risk
  • Proper cleanup ensures no resource leaks

Performance Considerations ✅

  • No performance impact - adds minimal overhead with try-catch
  • Ensures cleanup always proceeds, preventing resource leaks
  • Does not add unnecessary async operations

Test Coverage ⚠️

Gap Identified: No test coverage found for _onStop error scenarios.

Recommendations:

  1. Add a test case in src/driver-test-suite/tests/actor-error-handling.ts for onStop hook errors
  2. Test that the actor is properly removed from the driver's actor map even when _onStop throws
  3. Verify that subsequent operations on the same actor key work correctly after a failed stop

Example test structure:

test("should handle errors in onStop hook", async (c) => {
    const { client } = await setupDriverTest(c, driverTestConfig);
    const handle = client.errorInOnStopActor.getOrCreate();
    
    // Trigger stop (e.g., via destroy or timeout)
    // Verify actor is removed despite error
    // Verify new actor can be created with same key
});

Overall Assessment

Score: 8/10

This is a solid defensive programming fix that prevents actor shutdown errors from causing cascading failures. The implementation follows codebase conventions and is well-structured.

Action Items:

  1. Approve and merge this PR as-is (fixes a real issue)
  2. 🔄 Follow-up PR: Apply same fix to file-system driver
  3. 🧪 Follow-up PR: Add test coverage for onStop error scenarios
  4. 📊 Consider: Enhanced logging with actorId/context

Verdict:LGTM - Approve with recommendations for follow-up improvements.

@vercel
Copy link

vercel bot commented Nov 4, 2025

Deployment failed with the following error:

Too many requests - try again in 5 minutes (more than 120, code: "api-deployments-flood-pro").

Learn More: https://vercel.com/docs/limits#rate-limits

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review - PR #3325

Summary

This PR adds error handling to the _onStop lifecycle method in the EngineActorDriver to ensure that errors during actor shutdown don't prevent proper cleanup. The change wraps the _onStop call in a try-catch block and logs errors while continuing with actor removal.

Positive Aspects ✅

  1. Critical Bug Fix: This properly handles exceptions in _onStop to ensure actors are removed from the internal map even if the lifecycle hook fails
  2. Consistent Error Handling: The pattern matches the existing error handling in ActorInstance._onStop (src/actor/instance.ts:2032-2040) where errors in the onStop hook are caught and logged
  3. Appropriate Error Logging: Uses stringifyError() utility which is the correct pattern used throughout the codebase for error serialization
  4. Correct Import: Properly imports stringifyError from @/utils

Architecture Considerations 🏗️

Lifecycle Flow Context:
The #runnerOnActorStop method in the driver is called when the engine runner signals that an actor should stop. It then calls actor._onStop() which internally:

  • Aborts any pending operations
  • Calls the user-defined onStop hook (with timeout and error handling)
  • Disconnects connections
  • Waits for background tasks
  • Saves state

Why This Fix Is Important:
If _onStop throws an error, the actor would never be removed from this.#actors map, causing a memory leak and potentially preventing proper cleanup on subsequent operations.

Suggestions for Consideration 💭

1. Error Context

Consider adding the actorId to the error log for better debugging:

logger().error({
    msg: "error in _onStop, proceeding with removing actor",
    actorId,
    err: stringifyError(err),
});

2. Test Coverage

No test files currently cover _onStop behavior. Consider adding a test case for this scenario in the driver test suite:

  • Test that actors are properly removed even when _onStop throws
  • Test that errors are logged appropriately
  • Verify no memory leaks occur

3. Metrics/Observability

Consider adding a metric or counter for _onStop failures to track how often this error path is hit in production.

Code Quality ✅

  • Style: Follows TypeScript conventions and existing patterns
  • Error Message: Clear and actionable ("proceeding with removing actor")
  • Defensive Programming: Good practice to ensure cleanup happens regardless of lifecycle hook failures

Security & Performance 🔒⚡

  • No Security Concerns: Error handling does not expose sensitive information
  • Performance: Minimal impact - only adds a try-catch block around existing code
  • Resource Management: Actually improves resource management by ensuring actors are removed from the map

Verdict ✅

LGTM with minor suggestions. This is a solid defensive programming fix that prevents potential memory leaks and ensures proper cleanup. The implementation is consistent with existing error handling patterns in the codebase.

The suggestions above are optional improvements and should not block merging this PR.


Review generated with attention to the repository's CLAUDE.md conventions

@claude
Copy link

claude bot commented Nov 4, 2025

Code Review: PR #3325 - Explicitly Handle Errors in _onStop

Summary

This PR adds defensive error handling around the actor._onStop() call in the EngineActorDriver to prevent exceptions during actor shutdown from blocking actor cleanup. This is a critical bug fix that prevents potential memory leaks.


✅ Positive Aspects

  1. Prevents Memory Leaks: Without this fix, if _onStop() throws an error, the actor would never be removed from this.#actors, causing a memory leak.

  2. Follows Existing Patterns: The error handling approach using stringifyError is consistent with other error handling in the codebase (e.g., similar patterns exist in instance.ts for lifecycle hooks).

  3. Appropriate Import Management: The stringifyError import is correctly added at the top of the file (line 48) following CLAUDE.md guidelines.

  4. Good Error Message: The log message clearly indicates the system is proceeding with cleanup despite the error.

  5. Minimal Risk: This is a defensive change that only affects error paths, with no impact on normal operation.


🔧 Recommendations for Improvement

1. Add actorId to Error Log (Minor)

Include the actorId in the error log for easier debugging when multiple actors are running:

logger().error({
    msg: "error in _onStop, proceeding with removing actor",
    actorId,  // Add this
    err: stringifyError(err),
});

2. Apply Same Fix to File System Driver (Important)

The file system driver at src/drivers/file-system/global-state.ts:328 has the same _onStop call without error handling:

await actor.actor._onStop();

Recommendation: Apply identical error handling to the file system driver for consistency.

3. Add Test Coverage (Follow-up)

Consider adding tests to verify:

  • Actors are properly removed even when _onStop throws
  • Error is logged appropriately
  • Subsequent operations work correctly after a failed _onStop

Test location: src/driver-test-suite/tests/actor-error-handling.ts or a new src/driver-test-suite/tests/actor-lifecycle-errors.ts

4. Consider Standardizing Error Field Name (Nitpick)

The catch block uses err while instance.ts:2036-2038 uses error. Both work, but error appears more common in the codebase.


🔍 Technical Analysis

Why This Is Needed:

Looking at instance.ts:2004-2085, the _onStop method performs critical cleanup:

  • Disconnects connections (lines 2044-2050)
  • Waits for background promises (lines 2052-2055)
  • Clears timeouts and intervals (lines 2058-2061)
  • Saves state (line 2064)
  • Drains write queues (lines 2081-2084)

While _onStop has internal error handling for the user's onStop hook (lines 2023-2041), it can still throw from:

  • State persistence operations
  • Connection cleanup promises
  • Queue draining operations

This PR correctly catches these errors at the driver level, ensuring actors are always removed from the driver's map.


🎯 Security & Performance

  • Security: No concerns. stringifyError properly sanitizes errors for logging.
  • Performance: Negligible overhead, only affects shutdown path.
  • Best Practices: Follows CLAUDE.md logging conventions (lowercase messages, structured logging).

📊 Verdict

LGTM ✅ - This is a well-implemented bug fix that improves system robustness.

Impact: High (prevents memory leaks and cleanup failures)
Risk: Low (defensive error handling)
Code Quality: ⭐⭐⭐⭐ (4/5)

Action Items

This PR:

  • ✅ Approve and merge

Follow-up PRs (suggested):

  1. Apply same fix to file system driver
  2. Add test coverage for _onStop error scenarios
  3. Add actorId to error log for better debugging

Great work on identifying and fixing this edge case! 🎉

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 4, 2025

Merge activity

  • Nov 4, 8:09 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:11 PM UTC: CI is running for this pull request on a draft pull request (#3349) due to your merge queue CI optimization settings.
  • Nov 4, 8:41 PM UTC: The Graphite merge queue removed this pull request due to removal of a downstack PR #3342.
  • Nov 4, 8:52 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 4, 8:53 PM UTC: CI is running for this pull request on a draft pull request (#3353) due to your merge queue CI optimization settings.
  • Nov 4, 8:56 PM UTC: Merged by the Graphite merge queue via draft PR: #3353.

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: fix(rivetkit): explicitly handle errors in _onStop

Summary

This PR adds defensive error handling around the actor._onStop() call in EngineActorDriver.#runnerOnActorStop() to ensure actor cleanup proceeds even when the stop lifecycle hook throws an error.

Analysis

✅ Strengths

  1. Critical bug fix: Prevents errors in user-defined onStop hooks from blocking actor cleanup, which could lead to memory leaks (actors remaining in the #actors map indefinitely)

  2. Consistent with existing patterns:

    • Uses stringifyError() utility correctly (line 48:436)
    • Follows structured logging conventions from CLAUDE.md (lowercase message, structured fields)
    • Import already exists at top of file (line 48), not inline
  3. Good defensive programming: The error message clearly communicates the failure will be logged but cleanup will proceed - this is the correct behavior for a cleanup handler

  4. Aligns with ActorInstance design: The _onStop method in instance.ts:2004-2064 already has internal error handling for the user's onStop hook, but can still throw from state persistence, connection cleanup, or background promise handling - this adds an essential safety layer at the driver level

🔍 Observations \u0026 Suggestions

1. Minor: Include actorId in error log for debugging

Current code logs the error but not which actor failed:

logger().error({
    msg: "error in _onStop, proceeding with removing actor",
    err: stringifyError(err),
});

Consider adding actorId for better observability:

logger().error({
    msg: "error in _onStop, proceeding with removing actor",
    actorId,
    err: stringifyError(err),
});

2. Field naming: err vs error

The code uses err but looking at instance.ts:2036-2038, errors are typically logged as error:

this.#rLog.error({
    msg: "error in onStop",
    error: stringifyError(error),
});

Minor inconsistency - either is acceptable, but consider standardizing for easier log querying.

3. Similar pattern missing in file-system driver

The file-system driver at src/drivers/file-system/global-state.ts:328 lacks this error handling:

await actor.actor._onStop();

Consider applying the same defensive pattern there in a follow-up PR for consistency.

4. Test coverage gap

No tests found for the scenario where _onStop throws. Consider adding test coverage that:

  • Verifies actors are removed even when _onStop throws
  • Confirms subsequent operations work correctly
  • Validates error logging occurs

🔒 Security \u0026 Performance

  • ✅ No security concerns: stringifyError properly sanitizes error output
  • ✅ No performance impact: try-catch adds negligible overhead only during actor shutdown
  • ✅ Prevents resource leaks by ensuring cleanup always completes

Verdict: ✅ APPROVE

This is a well-implemented fix for a real reliability issue. The core change is correct and follows codebase conventions. The suggestions above are minor improvements for observability and consistency, not blockers.

Recommendation: Merge as-is, optionally address the actorId logging suggestion in a quick follow-up if desired.


Impact: High (prevents memory leaks and cleanup failures)
Risk: Low (defensive error handling with clear fallback)
Code Quality: 9/10

@graphite-app graphite-app bot closed this Nov 4, 2025
@graphite-app graphite-app bot deleted the 11-03-fix_rivetkit_explicitly_handle_errors_in__onstop branch November 4, 2025 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants