Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 4, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 5, 2025 4:05am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 5, 2025 4:05am
rivet-inspector Ignored Ignored Preview Nov 5, 2025 4:05am
rivet-site Ignored Ignored Preview Nov 5, 2025 4:05am

Copy link
Member Author

NathanFlurry commented Nov 4, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 4, 2025

Pull Request Review

Summary

This PR adds environment variable configuration to reduce Pegboard actor retry timeouts specifically for development environments where runners may be terminated ungracefully (SIGKILL instead of SIGTERM).

Code Quality ✅

Strengths:

  • The comment documentation is excellent and clearly explains the problem being solved
  • The change is minimal and focused on a specific development pain point
  • The solution is appropriate for the use case (reducing timeouts in dev to account for frequent restarts)

Observations:

  1. Hardcoded Values: The values are hardcoded as strings ("100") rather than using constants or configuration

    • Consider defining these as constants at the top of the file for maintainability
    • Example:
    const DEV_PEGBOARD_RETRY_RESET_DURATION = "100";
    const DEV_PEGBOARD_BASE_RETRY_TIMEOUT = "100";
  2. Units Missing in Code: The comment mentions these are in milliseconds, but the values themselves don't make this clear

    • The Rust config (pegboard.rs:36,48) shows defaults of 2000ms and 600000ms respectively
    • Using 100ms is a 20x reduction for base_retry_timeout and 6000x reduction for retry_reset_duration
    • Consider adding inline comments about units: "100" // ms
  3. Type Safety: Environment variables are set as strings, which matches how they'll be consumed, but there's no validation

Potential Issues ⚠️

  1. Magic Numbers: The value "100" (100ms) seems quite aggressive

    • From pegboard.rs:36, the production default for base_retry_timeout is 2000ms
    • From pegboard.rs:48, the production default for retry_reset_duration is 600000ms (10 minutes)
    • Is 100ms sufficient for both? This might cause excessive retry attempts
    • Consider documenting why 100ms was chosen for both values
  2. Missing Configuration: Only 2 of the 4 Pegboard timeout configurations are set

    • RIVET__PEGBOARD__ACTOR_START_THRESHOLD and RIVET__PEGBOARD__ACTOR_STOP_THRESHOLD are not configured
    • These default to 30 seconds (pegboard.rs:40,44)
    • Should these also be reduced for development? The PR title mentions all 4 configs
  3. Environment Detection: This applies to ALL engine processes spawned, not just development

    • There's no check for NODE_ENV or similar to ensure this only applies in dev
    • In production RivetKit usage, these aggressive timeouts could cause issues

Performance Considerations 📊

  1. Retry Frequency: With a 100ms base timeout, actors will retry very quickly after failures

    • This could increase load on the system during actual crashes
    • The exponential backoff (engine/packages/pegboard/src/workflows/actor/runtime.rs:619-625) helps, but the base is very low
  2. GC and Resource Cleanup: Rapid retries might not give sufficient time for proper cleanup between attempts

Security Concerns 🔒

No security concerns identified. This is a development-focused configuration change that only affects retry timing.

Test Coverage 🧪

Missing:

  • No tests for this new configuration
  • The existing test file (tests/engine-process.test.ts) only tests resolveTargetTripletFor
  • Consider adding a test to verify the environment variables are set correctly when spawning the engine process

Suggested Test:

it("sets development pegboard retry timeouts", () => {
  // Test that ensureEngineProcess sets the expected env vars
  // This would require refactoring to make spawn testable
});

Recommendations 📝

  1. Add environment detection to only apply these settings in development:

    env: {
      ...process.env,
      ...(process.env.NODE_ENV === 'development' && {
        RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
        RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",
      }),
    }
  2. Consider making these configurable via RivetKit options rather than hardcoded

  3. Review the timeout values - perhaps 100ms is too aggressive even for development?

    • Consider 500ms or 1000ms as a middle ground
    • Document the reasoning for the chosen values
  4. Add the missing configurations mentioned in the base branch name if they're also needed for the dev experience

  5. Add basic test coverage to verify the configuration is applied

Verdict

This is a sensible quality-of-life improvement for development, but would benefit from:

  • Environment detection to ensure it only applies in dev
  • Review of the timeout values (100ms may be too aggressive)
  • Test coverage
  • Constants instead of magic strings

The code is well-documented and addresses a real pain point. With the above improvements, this would be ready to merge.

Approve with suggestions

@claude
Copy link

claude bot commented Nov 5, 2025

Code Review - PR #3355

Summary

This PR reduces Pegboard actor retry timeouts for spawned engine processes in development to prevent actors from becoming unresponsive when frequently killing dev servers.

Findings

1. Configuration Values - Potential Issue

The PR sets both environment variables to "100" milliseconds:

RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",

Default production values (from engine/packages/config/src/config/pegboard.rs:36,48):

  • base_retry_timeout: 2000ms (2 seconds)
  • retry_reset_duration: 600000ms (10 minutes)

Concerns:

  • 100ms is extremely aggressive - This is 20x faster than production for base_retry_timeout and 6000x faster for retry_reset_duration
  • The backoff algorithm uses exponential backoff with base 8 (see runtime.rs:819): Backoff::new_at(8, None, base_retry_timeout, 500, retry_count)
  • With base_retry_timeout=100ms, retries will happen at: 100ms, 800ms, 6.4s, 51.2s, etc.
  • With retry_reset_duration=100ms, the retry counter resets after only 100ms of successful operation, meaning a single hiccup could restart the backoff cycle

Recommendations:

  • Consider more conservative values like:
    • RETRY_RESET_DURATION: "5000" (5 seconds) - still 120x faster than production
    • BASE_RETRY_TIMEOUT: "500" (500ms) - still 4x faster than production
  • These would still be dev-friendly while preventing potential thrashing

2. Code Quality - Good

Positive aspects:

  • Excellent inline documentation explaining the problem and rationale
  • Comment clearly describes the development vs production behavior difference
  • The approach of using environment variables is appropriate for development-specific configuration

3. Potential Issues

⚠️ Missing considerations:

  • No validation that this only runs in development environments
  • The code unconditionally sets these values for all spawned engines
  • If RivetKit is used in production-like scenarios, these aggressive timeouts could cause issues

Suggestion: Add a check to only apply these settings in development:

env: {
	...process.env,
	...(process.env.NODE_ENV === 'development' && {
		RIVET__PEGBOARD__RETRY_RESET_DURATION: "5000",
		RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "500",
	}),
},

4. Testing Considerations

Questions:

  • Have you tested these values work as expected when killing the dev server?
  • Have you verified the retry behavior with these extremely low timeouts?
  • Consider testing:
    • Rapid server restarts (< 100ms apart)
    • Multiple concurrent actor failures
    • Memory/CPU impact of frequent retries

5. Documentation

✅ The inline comments are excellent and explain the "why" clearly.

Summary

The change addresses a real development pain point with a reasonable approach. However, the timeout values chosen (100ms) are extremely aggressive and may cause unintended side effects. Consider:

  1. Immediate fix: Increase timeouts to more conservative development values (500-5000ms range)
  2. Better fix: Add environment check to only apply in development
  3. Best fix: Both of the above, plus expose these as configurable options

Let me know if you'd like me to elaborate on any of these points!

@NathanFlurry NathanFlurry force-pushed the 11-04-fix_rivetkit_reduce_pegboard_actor_retry_for_spawned_engines branch from b064b66 to 68aeeee Compare November 5, 2025 02:07
@NathanFlurry NathanFlurry force-pushed the 11-04-chore_pegboard_allow_configuring_base_retry_timeout_actor_start_threshold_actor_stop_threshold_and_retry_reset_duration branch from 91b48b3 to 1527e67 Compare November 5, 2025 02:07
@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: Reduce pegboard actor retry for spawned engines

Summary

This PR adds environment variable overrides for Pegboard retry configuration when spawning the Rivet Engine in development mode. The change addresses a development-specific issue where frequently killing dev servers causes actors to become unresponsive due to production-oriented retry backoff logic.

Code Quality ✅

Strengths:

  • Clear, comprehensive documentation explaining the problem and solution
  • Follows existing patterns in the codebase for environment variable configuration
  • Comment explains the "why" effectively (development vs production behavior)

Observations:

  • The values are hardcoded as strings ("100") which is acceptable for this dev-specific override
  • The configuration correctly uses the RIVET__PEGBOARD__ prefix which maps to the config structure via the config crate's environment variable parsing (line 40-42 in engine/packages/config/src/lib.rs)

Configuration Values 🔍

The PR sets both retry parameters to 100ms:

  • RIVET__PEGBOARD__RETRY_RESET_DURATION: "100" (default: 600,000ms / 10 minutes)
  • RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100" (default: 2,000ms)

Analysis:

  • Default retry_reset_duration is 10 minutes (rivet/engine/packages/config/src/config/pegboard.rs:48)
  • Default base_retry_timeout is 2 seconds (rivet/engine/packages/config/src/config/pegboard.rs:36)
  • The 100ms values provide near-instant retry in development, which aligns with the stated goal
  • These values are used in the exponential backoff calculation (rivet/engine/packages/pegboard/src/workflows/actor/runtime.rs:818-820)

Potential Considerations 💭

  1. Value Hardcoding: Consider whether these values should be:

    • Configurable via RivetKit options
    • Defined as named constants for maintainability
    • Current approach is fine for a dev-only fix, but future enhancement could make this configurable
  2. Environment Detection: The comment states this is for "development" but the code doesn't check if it's actually running in dev mode. This means:

    • These values will always be set when using ensureEngineProcess
    • If this function is used in production contexts, it could affect production behavior
    • Question: Is ensureEngineProcess exclusively used in development contexts?
  3. Units Clarity: The values are in milliseconds but passed as strings. While the config system handles this correctly, consider whether adding a comment about units would be helpful (though the field names already reference duration/timeout).

Security ✅

No security concerns. This is a development experience improvement that reduces retry delays.

Performance ✅

Performance impact is intentional and beneficial for development:

  • Faster actor rescheduling in dev eliminates frustrating delays
  • No negative production impact if this code path is dev-only

Test Coverage ⚠️

Recommendations:

  • Consider adding a test to verify these environment variables are correctly set when spawning the engine
  • Manual testing should verify:
    • Actors reschedule quickly after SIGKILL in dev
    • No unintended side effects from the reduced timeouts

Summary

This is a well-reasoned fix for a legitimate development pain point. The implementation is clean and well-documented.

Recommendation: Approve

Optional follow-up improvements (not blocking):

  1. Add constants for the timeout values with descriptive names
  2. Verify/document that ensureEngineProcess is dev-only
  3. Consider making these configurable via RivetKit options in the future

Great work on identifying and fixing this developer experience issue!

@claude
Copy link

claude bot commented Nov 5, 2025

Code Review

Overview

This PR adds environment variable configuration to reduce Pegboard actor retry timeouts when spawning engine processes in development. The change addresses a specific development workflow issue where runners terminated without graceful shutdown (SIGKILL) trigger production backoff logic, making actors unresponsive during frequent dev server restarts.

Positive Aspects

Good problem identification: The comment clearly explains the issue being solved
Appropriate solution: Using environment variables is the right approach for configuration
Development-only change: This only affects spawned engine processes in RivetKit, not production

Issues & Recommendations

1. Magic Numbers Without Context ⚠️

The values 100 (milliseconds) are used without explanation:

RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",

Default values for reference:

  • RETRY_RESET_DURATION: default is 10 minutes (600,000 ms) → changed to 100 ms
  • BASE_RETRY_TIMEOUT: default is 2 seconds (2,000 ms) → changed to 100 ms

Recommendation: Add a comment explaining why 100ms was chosen, or define these as named constants at the file level:

// Development-specific timeout values to enable rapid iteration.
// Production defaults: RETRY_RESET_DURATION=600000ms, BASE_RETRY_TIMEOUT=2000ms
const DEV_RETRY_RESET_DURATION = "100";
const DEV_BASE_RETRY_TIMEOUT = "100";

2. Unit Consistency ⚠️

The values are documented as milliseconds in the Rust config (pegboard.rs:9,27), but the string values "100" might benefit from clarity.

Recommendation: Add a comment clarifying the unit:

// Values in milliseconds
RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",

3. Production Safety ✅ (but worth verifying)

These environment variables only affect the spawned engine process, not the parent RivetKit process. However, it's worth verifying:

Question: Is there any scenario where RivetKit might be used in production with runEngine: true? If so, we may need to conditionally set these based on NODE_ENV.

Recommendation: Consider adding a guard:

env: {
    ...process.env,
    ...(process.env.NODE_ENV !== 'production' && {
        RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
        RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",
    }),
},

4. Test Coverage ⚠️

No tests were added for this change. While the existing test file (engine-process.test.ts) only tests resolveTargetTripletFor, it would be valuable to verify the environment variables are correctly set.

Recommendation: Consider adding a test that verifies these environment variables are present when spawning the engine process (can be done in a follow-up PR).

5. Documentation 📝

The inline comment is good, but this behavior should potentially be documented in RivetKit docs or README.

Recommendation: Document this behavior where RivetKit's development mode is explained.

Security Considerations

✅ No security concerns - these are timeout configuration values

Performance Considerations

Positive impact: Reduces development friction by allowing faster recovery from crash loops
⚠️ Potential concern: If accidentally used in production, could cause excessive retry attempts under load

Code Quality

  • Follows existing code style ✅
  • Comment quality is good ✅
  • Variable naming follows Rust config conventions ✅

Summary

The change is sound and addresses a real development pain point. The main improvements would be:

  1. Extract magic numbers to named constants
  2. Add production safety guard
  3. Consider test coverage in follow-up

Recommendation:Approve with minor suggestions

The code is functional and safe for development use. The suggestions above are for improved maintainability and production safety, but are not blockers.

@NathanFlurry NathanFlurry force-pushed the 11-04-chore_pegboard_allow_configuring_base_retry_timeout_actor_start_threshold_actor_stop_threshold_and_retry_reset_duration branch from 1527e67 to 5ed524d Compare November 5, 2025 02:18
@NathanFlurry NathanFlurry force-pushed the 11-04-fix_rivetkit_reduce_pegboard_actor_retry_for_spawned_engines branch from 68aeeee to 4a3505a Compare November 5, 2025 02:18
@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: fix(rivetkit): reduce pegboard actor retry for spawned engines

Summary

This PR adds development-specific environment variable overrides to reduce retry timeouts for Pegboard actors when spawning Rivet Engine processes through RivetKit. This addresses a development workflow issue where frequent dev server restarts cause actors to become unresponsive due to production-oriented backoff strategies.

Code Quality ✅

Strengths:

  • Clear, comprehensive comment explaining the rationale (lines 88-97)
  • Follows the codebase's existing patterns for environment configuration
  • Minimal, focused change that addresses a specific development pain point

Observations:

  • The environment variables are set to very low values ("100" ms) compared to production defaults (2000 ms for base_retry_timeout, 10 * 60 * 1000 = 600,000 ms for retry_reset_duration as seen in engine/packages/config/src/config/pegboard.rs:36,48)
  • These values represent a 20x reduction for base_retry_timeout and 6000x reduction for retry_reset_duration

Potential Issues ⚠️

  1. Type Safety / String Values: The values are passed as strings ("100"), but looking at the config file (engine/packages/config/src/config/pegboard.rs:12,31), these should be parsed as usize and i64 respectively. This should work fine as Rust's config parsing will handle the string-to-number conversion, but it's worth verifying this works correctly.

  2. Magic Numbers: The value 100 ms is hardcoded without clear justification. Consider:

    • Is 100ms the right balance between responsiveness and avoiding excessive retries?
    • Should these be configurable via RivetKit's own config?
    • Adding a comment explaining why specifically 100ms was chosen
  3. Development vs Production Distinction: While the comment mentions "in development," there's no explicit check to ensure these overrides only apply in development environments. Consider:

    • Should this check NODE_ENV or another environment variable?
    • Could this accidentally affect production builds?

Performance Considerations 🚀

Positive:

  • Significantly improves development iteration speed by reducing wait times from minutes to milliseconds
  • Only affects spawned engine processes, not the RivetKit process itself

Potential Concerns:

  • Setting retry timeouts to 100ms could cause rapid retry loops if there are actual issues
  • Could mask real problems during development that would surface in production
  • The extremely low retry_reset_duration (100ms vs 10 minutes) means actors will reset their backoff almost immediately, potentially causing resource thrashing

Security Concerns 🔒

No security concerns identified. This change only affects development workflow timing parameters.

Test Coverage 📋

Missing:

  • No tests verify these environment variables are properly passed to the spawned process
  • No tests verify the behavior changes when these timeouts are reduced
  • No integration tests for the development scenario described (frequent SIGKILL)

Recommendation: Consider adding a test that:

  1. Spawns an engine process
  2. Verifies the environment variables are set correctly
  3. (Optional) Verifies the retry behavior actually changes

Suggestions for Improvement

  1. Add constants for the timeout values:

    const DEV_PEGBOARD_RETRY_RESET_DURATION_MS = 100;
    const DEV_PEGBOARD_BASE_RETRY_TIMEOUT_MS = 100;
  2. Consider environment-aware configuration:

    const isDevelopment = process.env.NODE_ENV === 'development';
    env: {
      ...process.env,
      ...(isDevelopment ? {
        RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
        RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",
      } : {}),
    }
  3. Extend the comment to explain the specific timeout values chosen:

    // We reduce the timeouts for resetting a runner as healthy in
    // order to account for this:
    // - retry_reset_duration: 100ms (vs 10min production) - reset backoff quickly
    // - base_retry_timeout: 100ms (vs 2s production) - minimal initial retry delay
  4. Document this behavior in development setup documentation so developers understand why actors recover quickly in dev but not in production.

Verdict: ✅ Approve with Suggestions

This is a pragmatic fix for a real development workflow issue. The code quality is good, the rationale is clear, and the change is appropriately scoped. However, the extremely aggressive timeout reduction (especially 6000x for retry_reset_duration) may warrant some tuning or at least clearer documentation about the trade-offs.

The suggestions above would make this more maintainable and explicit, but they're not blockers for merging.

@claude
Copy link

claude bot commented Nov 5, 2025

Pull Request Review

Summary

This PR adds development-specific configuration for Pegboard actor retry behavior when spawning Rivet Engine processes from RivetKit. It reduces retry timeouts from production defaults to improve the development experience.

Code Quality & Best Practices

Positives:

  • ✅ Excellent inline documentation explaining the rationale for the change
  • ✅ Clear explanation of the development vs production behavior difference
  • ✅ Follows TypeScript conventions for environment variable configuration
  • ✅ Non-invasive change that only affects development environment

Suggestions:

  1. Magic Numbers (Minor): The values "100" are hardcoded. Consider defining them as constants for better maintainability:

    const DEV_RETRY_RESET_DURATION_MS = "100";
    const DEV_BASE_RETRY_TIMEOUT_MS = "100";

    This would make it easier to adjust these values in the future and provide semantic meaning.

  2. Type Safety (Minor): The environment variables are set as strings. While this matches the spawn API requirements, adding a comment about the expected unit (milliseconds) would be helpful for future maintainers.

Potential Issues

  1. Production Environment Detection (Medium): The PR description mentions "in development", but the code doesn't actually check if it's running in a development environment. These reduced timeouts will be applied in all environments where ensureEngineProcess is called.

    Consider adding an environment check:

    env: {
        ...process.env,
        ...(process.env.NODE_ENV === 'development' && {
            RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
            RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",
        }),
    }
  2. Value Selection Rationale: The values are set to 100ms compared to production defaults of:

    • base_retry_timeout: 2000ms (default)
    • retry_reset_duration: 600000ms / 10 minutes (default)

    Setting RETRY_RESET_DURATION to 100ms (from 10 minutes) is an extremely aggressive reduction. This means the retry count resets almost immediately. Was this intentional? Based on the code in runtime.rs:731, this controls how long before retry counts reset to treat the runner as "healthy" again.

Performance Considerations

  • Impact: The 100ms timeouts mean actors will retry almost immediately with minimal backoff, which could lead to rapid retry loops if there are actual issues
  • Trade-off: This is acceptable for development where quick feedback is valued over stability, but the lack of environment detection means this could affect non-dev deployments if RivetKit is used there

Security Concerns

  • ✅ No security issues identified
  • The change only affects retry timing and doesn't expose sensitive data or create vulnerabilities

Test Coverage

Missing Tests:

  • No tests verify the environment variables are set correctly
  • No tests validate the behavior with these reduced timeouts
  • The existing test file (tests/engine-process.test.ts) only covers resolveTargetTripletFor

Recommendations:

  1. Add a test to verify the spawn configuration includes these environment variables
  2. Consider adding integration tests that validate the faster retry behavior in development

Documentation

Suggestions:

  1. Update RivetKit documentation to mention these development-specific configurations
  2. Consider documenting the expected behavior when developers frequently restart their servers
  3. The inline comment is excellent, but consider also adding this to the CLAUDE.md or a development guide

Questions for the Author

  1. Why was 100ms chosen for both values? Was this based on testing or an arbitrary choice?
  2. Should there be a way for developers to opt-out of this behavior if they want to test production-like retry behavior locally?
  3. Have you considered making these values configurable through RivetKit options?

Summary

This is a pragmatic fix for a real development pain point. The code quality is good with clear documentation. However, the lack of environment detection and the extremely aggressive timeout reduction are concerns that should be addressed before merging.

Recommendation: Request changes to add environment detection or make this behavior explicit and configurable.


Review generated by Claude Code

@graphite-app graphite-app bot force-pushed the 11-04-chore_pegboard_allow_configuring_base_retry_timeout_actor_start_threshold_actor_stop_threshold_and_retry_reset_duration branch from 5ed524d to 6bc2a41 Compare November 5, 2025 02:31
@graphite-app graphite-app bot force-pushed the 11-04-fix_rivetkit_reduce_pegboard_actor_retry_for_spawned_engines branch from 4a3505a to 6a2bd5c Compare November 5, 2025 02:32
@claude
Copy link

claude bot commented Nov 5, 2025

Code Review

Summary

This PR reduces the pegboard actor retry timeouts for spawned engines in development mode to improve the developer experience when frequently restarting dev servers. The change addresses a pain point where actors become unresponsive due to aggressive backoff when dev servers are killed ungracefully.

Positive Aspects ✅

  1. Clear Problem Statement: The inline comments excellently explain the rationale - distinguishing between production crash loops and development server restarts.

  2. Scoped Impact: The change only affects the RivetKit TypeScript engine process spawning, not production configurations. This is appropriate for a development-focused improvement.

  3. Well-Documented: The 8-line comment clearly explains why this change is needed, making future maintenance easier.

Issues & Concerns

1. Magic Numbers Without Context ⚠️

The values are set to "100" (100ms) which is:

  • 82x smaller than the production RETRY_RESET_DURATION default (10 minutes = 600,000ms)
  • 20x smaller than the production BASE_RETRY_TIMEOUT default (2,000ms)

Questions:

  • Why specifically 100ms? Was this empirically tested?
  • Should these values be constants with descriptive names instead of magic numbers?
  • Would slightly higher values (e.g., 250ms or 500ms) still solve the problem while being less aggressive?

Recommendation: Consider adding a comment explaining the 100ms choice, or extract to named constants:

const DEV_RETRY_RESET_DURATION_MS = "100"; // Fast reset for dev server restarts
const DEV_BASE_RETRY_TIMEOUT_MS = "100";    // Minimal backoff in development

2. Type Inconsistency ⚠️

The environment variables are set as strings ("100") but the Rust config expects numeric types:

  • base_retry_timeout: Option<usize> (engine/packages/config/src/config/pegboard.rs:12)
  • retry_reset_duration: Option<i64> (engine/packages/config/src/config/pegboard.rs:31)

This works because environment variables are inherently strings and likely parsed by the config system, but the code would be clearer with:

RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",  // Current - works but unclear
// vs
RIVET__PEGBOARD__RETRY_RESET_DURATION: String(100),  // Explicit string conversion

3. Potential Production Leakage Risk ⚠️

While this code only runs in RivetKit's ensureEngineProcess, there's no explicit check that this is development mode. If someone uses ensureEngineProcess in production (unlikely but possible), they'd get development timeouts.

Recommendation: Consider adding a safety check:

env: {
  ...process.env,
  // Only apply dev overrides in non-production
  ...(process.env.NODE_ENV !== 'production' && {
    RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
    RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",
  }),
}

4. Testing Gap 📋

No tests verify this behavior. Consider:

  • How would we test that ungraceful shutdowns don't cause excessive backoff?
  • Should there be integration tests for the retry behavior?

Performance Considerations

Performance impact is minimal - these are only configuration values set at spawn time.

⚠️ Potential concern: With 100ms retries, a legitimately failing actor in development could create a tight retry loop. However, given the backoff implementation in reschedule_backoff (engine/packages/pegboard/src/workflows/actor/runtime.rs:818-820), this should still have exponential backoff even with the lower base.

Security Considerations

No security concerns - this is a development-focused configuration change with no exposed attack surface.

Best Practices & Code Quality

According to CLAUDE.md:

  • ✅ Uses structured logging patterns
  • ✅ Follows existing code style
  • ⚠️ Hard-coded values instead of constants (minor deviation from maintainability best practices)

TypeScript Practices:

  • ✅ Properly spreads existing env vars
  • ✅ Clear inline documentation
  • ⚠️ Could benefit from extracting magic numbers

Recommendations Summary

High Priority:

  1. Add a comment explaining why 100ms was chosen
  2. Consider extracting to named constants

Medium Priority:
3. Add safety check for production environment
4. Consider slightly higher values if 100ms is too aggressive

Low Priority:
5. Add integration tests for retry behavior in development

Verdict

Approved with suggestions

This is a practical fix for a real development pain point. The changes are isolated, well-documented, and low-risk. The suggestions above would make the code more maintainable, but the current implementation is acceptable for merging.


Review performed using repository CLAUDE.md guidelines

@NathanFlurry NathanFlurry force-pushed the 11-04-fix_rivetkit_reduce_pegboard_actor_retry_for_spawned_engines branch from 6a2bd5c to 87467c8 Compare November 5, 2025 04:04
@NathanFlurry NathanFlurry force-pushed the 11-04-chore_pegboard_allow_configuring_base_retry_timeout_actor_start_threshold_actor_stop_threshold_and_retry_reset_duration branch from 6bc2a41 to 3e43c22 Compare November 5, 2025 04:04
@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: Reduce Pegboard Actor Retry for Spawned Engines

Summary

This PR reduces retry timeouts for Pegboard actors in development environments to improve developer experience when frequently restarting dev servers.

Code Quality ✅

Strengths:

  • Well-documented rationale with clear inline comments explaining the problem and solution
  • Follows existing code patterns in the file
  • Comments are properly formatted and informative

Minor Suggestions:

  1. Configuration values lack units: The values "100" are set without explicit units in the variable names or comments. Based on the Rust code analysis, these appear to be milliseconds, but this should be explicit:
    // Current
    RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
    RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",
    
    // Suggested comment addition
    // Set to 100ms (default: 10 minutes and 2 seconds respectively)
    RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
    RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",

Potential Issues ⚠️

  1. Production vs Development safety: These environment variables are being set unconditionally when spawning the engine process. The comment says "In development" but there's no check to ensure this only applies to development environments. Consider:

    • Is ensureEngineProcess only called in development contexts?
    • Should there be an explicit check or environment variable to gate this behavior?
  2. Extremely aggressive timeouts: 100ms is very short compared to the defaults (10 minutes for RETRY_RESET_DURATION, 2 seconds for BASE_RETRY_TIMEOUT). While appropriate for development:

    • This could cause issues if an engine legitimately takes longer to start
    • Could mask real crash loops during development
    • Consider if these values were chosen empirically or if they could be slightly higher (e.g., 500ms-1000ms)
  3. No validation: The values are hardcoded strings with no validation. The Rust side expects usize and i64 types respectively. While Rust will handle parsing, consider:

    • What happens if parsing fails?
    • Should these be constants at the top of the file for easier maintenance?

Performance Considerations ⚡

Positive Impact:

  • Significantly improves development iteration speed by reducing wait times from minutes to milliseconds
  • Prevents actors from becoming unresponsive during rapid dev server restarts

Potential Concerns:

  • Very short retry timeouts (100ms) might cause unnecessary retries if the engine takes longer to become healthy
  • The waitForEngineHealth() function already has a 10-second timeout with 100ms intervals - ensure these retry values don't conflict with that logic

Security Concerns 🔒

No security concerns identified. This is a development-focused configuration change that:

  • Only affects local engine spawning
  • Doesn't expose sensitive data
  • Doesn't modify authentication or access control

Test Coverage 📋

Current State:

  • No tests exist for this new configuration
  • Existing tests in engine-process.test.ts only cover target triplet resolution

Recommendations:

  1. Add a test to verify the environment variables are set when spawning the engine:

    it("sets pegboard retry configuration for development", () => {
      // Test that env vars include RIVET__PEGBOARD__ configs
    });
  2. Consider integration tests to verify the behavior works as expected with rapid restarts

Architecture Alignment 🏗️

Follows CLAUDE.md conventions:

  • ✅ Proper commit message format: fix(rivetkit): reduce pegboard actor retry for spawned engines
  • ✅ Uses lowercase log messages (not applicable here, but generally followed)
  • ⚠️ Missing test coverage (CLAUDE.md emphasizes testing)

Recommendations Summary

Must Address:

  • Clarify if this should only apply in development environments and add appropriate guards if needed

Should Consider:

  • Add inline comment about the 100ms unit and default values for context
  • Extract magic numbers to named constants
  • Evaluate if 100ms is too aggressive (consider 500ms-1000ms)
  • Add basic test coverage for environment variable setting

Nice to Have:

  • Document this configuration in RivetKit development docs
  • Add a way to override these values via environment variables for local customization

Overall Assessment

This is a sensible fix for a real development pain point. The implementation is clean and well-documented. With minor improvements around safety checks and test coverage, this will be a solid contribution to developer experience.

Recommendation: Approve with minor changes suggested


Review generated by Claude Code

@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: Reduce Pegboard Actor Retry for Spawned Engines

Summary

This PR reduces the retry timeouts for Pegboard actors when running in development mode to avoid actors becoming unresponsive when frequently restarting the dev server.


Positive Aspects ✅

  1. Clear Documentation: The inline comments excellently explain the problem (dev servers getting SIGKILL vs SIGTERM) and why the change is needed.

  2. Appropriate Scope: The change is minimal and targeted - only affecting development environments spawned by RivetKit.

  3. Solves Real Dev Pain: Addresses a legitimate UX issue where actors become unresponsive during active development.


Issues & Concerns 🔍

Critical: Extremely Aggressive Timeout Values

The values set (100 milliseconds) are extremely aggressive and may cause issues:

RIVET__PEGBOARD__RETRY_RESET_DURATION: "100",
RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "100",

Context from the Rust codebase:

  • base_retry_timeout: Default is 2000ms (2 seconds) - packages/config/src/config/pegboard.rs:36
  • retry_reset_duration: Default is 600000ms (10 minutes) - packages/config/src/config/pegboard.rs:48

Your changes reduce these by:

  • base_retry_timeout: 95% reduction (2000ms → 100ms)
  • retry_reset_duration: 99.98% reduction (10 min → 100ms)

Potential Problems:

  1. Retry Reset Duration (100ms): This controls how long the system waits before resetting the retry count back to 0. At 100ms, virtually every retry will reset the counter, defeating the exponential backoff mechanism entirely. This could cause:

    • Infinite retry loops without backoff
    • Excessive resource consumption on actual failures
    • Difficulty distinguishing between temporary and permanent failures
  2. Base Retry Timeout (100ms): While reducing the initial retry delay makes sense for dev, 100ms might be too aggressive. Consider:

    • Cold start times for spawning new engine processes
    • Network latency and health check overhead
    • Risk of overwhelming the system with rapid retries

Recommendations

Option 1: More Conservative Values (Recommended)

RIVET__PEGBOARD__RETRY_RESET_DURATION: "30000",  // 30 seconds
RIVET__PEGBOARD__BASE_RETRY_TIMEOUT: "500",      // 500ms

This still provides fast dev iteration (30s vs 10min reset) while maintaining some backoff protection.

Option 2: Add Configuration Comments
If you've tested these values and they work well, add comments explaining:

  • Why 100ms was chosen
  • What testing validated these values
  • Any trade-offs accepted for dev convenience

Additional Observations

Code Quality

  • ✅ Follows existing patterns in the file
  • ✅ Environment variables properly prefixed with RIVET__PEGBOARD__
  • ✅ Values passed as strings (correct for env vars)
  • ✅ Config system will properly deserialize to numeric types (packages/config/src/lib.rs:40-42)

Missing Considerations

  1. No Test Coverage: Should there be tests validating these dev-mode behaviors?
  2. Production Safety: Good that this only affects dev spawned engines, but consider documenting how to ensure these values are never used in production deployments
  3. Observability: With such aggressive retries, are there logs/metrics to help devs understand retry behavior?

Questions for Author

  1. Have you tested actor behavior with these specific timeout values during rapid dev server restarts?
  2. Did you observe any issues with actors retrying too quickly or not backing off properly?
  3. Would it be valuable to make these configurable (e.g., via RivetKit config) rather than hardcoded?

Verdict

Request Changes ⚠️

The approach is sound, but the timeout values appear too aggressive and may cause unintended issues. I recommend:

  1. Increase RIVET__PEGBOARD__RETRY_RESET_DURATION to at least 30 seconds (preferably 60s)
  2. Consider increasing RIVET__PEGBOARD__BASE_RETRY_TIMEOUT to 500ms or 1000ms
  3. Add a comment explaining the chosen values and any testing performed

Once the timeout values are adjusted to be less aggressive while still solving the dev UX issue, this will be a solid improvement.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 5, 2025

Merge activity

  • Nov 5, 5:07 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 5, 5:08 AM UTC: CI is running for this pull request on a draft pull request (#3371) due to your merge queue CI optimization settings.
  • Nov 5, 5:11 AM UTC: Merged by the Graphite merge queue via draft PR: #3371.

@graphite-app graphite-app bot closed this Nov 5, 2025
@graphite-app graphite-app bot deleted the 11-04-fix_rivetkit_reduce_pegboard_actor_retry_for_spawned_engines branch November 5, 2025 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants