Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 4, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 5, 2025 2:32am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 5, 2025 2:32am
rivet-inspector Ignored Ignored Preview Nov 5, 2025 2:32am
rivet-site Ignored Ignored Preview Nov 5, 2025 2:32am

Copy link
Contributor Author

MasterPtato commented Nov 4, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Add reschedule_ts to actors

Summary

This PR adds a new reschedule_ts timestamp field to track when an actor will attempt to allocate again after rescheduling. This provides better visibility into actor scheduling state and helps prevent premature allocation attempts during backoff periods.


Code Quality & Best Practices

Positive observations:

  • ✅ Good use of structured logging patterns (e.g., tracing::debug!(?actor_id, "rescheduling actor"))
  • ✅ Proper use of #[serde(default)] attribute for backward compatibility with existing state
  • ✅ Consistent timestamp naming convention (reschedule_ts) following the established pattern
  • ✅ Documentation added to OpenAPI spec and type definitions
  • ✅ Timestamp ordering in struct fields is logical and grouped appropriately

Areas for improvement:

  1. Backoff refactoring in backoff.rs:88

    let next_wait = self.current_duration() + rand::thread_rng().gen_range(0..self.randomness);

    This change extracts self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32) into current_duration(), which is cleaner. However, note that this changes behavior slightly - the old code calculated the base duration fresh each time, while now it uses the extracted method. Verify this is intentional and doesn't affect existing backoff behavior in other parts of the codebase.

  2. Minor variable naming issue in gasoline/src/ctx/standalone.rs:100

    -		Ok(new_ctx)
    +		Ok(ctx)

    Good catch! This appears to fix a bug where new_ctx variable was referenced but never defined. The code creates ctx and should return ctx.

  3. Missing documentation in workflow state
    The new reschedule_ts field in workflows/actor/mod.rs:62 has #[serde(default)] but lacks a doc comment explaining its purpose, unlike other timestamp fields in the Actor type. Consider adding:

    /// When the actor will next attempt allocation (during backoff rescheduling)
    #[serde(default)]
    pub reschedule_ts: Option<i64>,

Potential Bugs or Issues

  1. Race condition consideration in compare_retry activity (runtime.rs:724-739)
    The reschedule_ts is set in the compare_retry activity based on the current retry count. However, this activity is called BEFORE the retry count is incremented in reschedule_actor (line 632-636). This means:

    • Activity calculates: reschedule_ts = now + backoff(retry_count)
    • Then reschedule_state.retry_count is incremented

    Is this the intended behavior? If the reschedule_ts should reflect the NEXT retry attempt, the calculation should use retry_count + 1. Please verify this logic is correct.

  2. Inconsistent state update timing
    The reschedule_ts is set in the compare_retry activity (runtime.rs:735), but it's never explicitly cleared when an actor successfully allocates. Consider adding explicit clearing in:

    • allocate_actor activity when allocation succeeds
    • update_runner activity when externally allocated
    • set_started activity when actor becomes connectable
  3. Type conversion safety
    In runtime.rs:735:

    state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);

    The try_from can theoretically fail if current_duration() returns a usize that doesn't fit in i64. While unlikely in practice (backoff durations are small), consider whether this should return an error or use saturating conversion.


Performance Considerations

  • ✅ No performance concerns identified
  • ✅ The new field adds minimal overhead (8 bytes per actor state)
  • ✅ Database operations remain efficient (no additional queries)
  • ✅ The reschedule_backoff helper function (runtime.rs:813) is appropriately extracted to avoid duplication

Security Concerns

  • ✅ No security issues identified
  • ✅ The timestamp is server-controlled and not user-supplied
  • ✅ No data exposure concerns (timestamp is internal scheduling metadata)

Test Coverage

Observation: While the changes are relatively straightforward, consider adding test coverage for:

  1. Reschedule timestamp behavior

    • Verify reschedule_ts is set correctly during backoff
    • Verify reschedule_ts is cleared on reset
    • Verify reschedule_ts reflects the correct backoff calculation
  2. API response validation

    • Verify the new field appears correctly in OpenAPI responses
    • Test that reschedule_ts is null for actors not in rescheduling state
  3. Backward compatibility

    • Verify old workflow states without reschedule_ts deserialize correctly with the #[serde(default)] attribute

Suggested test locations:

  • engine/packages/engine/tests/actors_lifecycle.rs - for lifecycle timestamp tests
  • engine/packages/pegboard/src/workflows/actor/runtime.rs - unit tests for reschedule_backoff and timestamp calculations

Additional Recommendations

  1. Consider adding metrics
    Since this tracks rescheduling behavior, consider adding a metric to track:

    • Number of actors currently in rescheduling state
    • Average backoff duration
    • Reschedule success rate
  2. Database schema migration
    Verify that the addition of reschedule_ts to the workflow state doesn't require any database migrations or cleanup of existing records.

  3. Timestamp consistency ordering
    In actor/get.rs:90-95, the timestamps are reordered. While this improves readability, ensure this ordering is documented somewhere (perhaps in CLAUDE.md) as a convention for timestamp fields.


Summary

Overall, this is a solid PR that adds useful observability into actor rescheduling behavior. The main concerns are:

  1. Verify the retry count vs reschedule_ts calculation timing is correct
  2. Consider explicitly clearing reschedule_ts when actors successfully allocate
  3. Add test coverage for the new field

The code follows Rivet conventions well and the implementation is clean. Nice work! 🎉

@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: Add reschedule_ts to actors

Thanks for this PR! I've reviewed the changes and have some feedback below.

Summary

This PR adds a reschedule_ts field to track when an actor will attempt allocation again after a rescheduling event. The changes also include:

  • Documentation improvements for timestamp fields in the Actor type
  • A bug fix in gasoline standalone context
  • Refactoring backoff calculation logic

Code Quality & Best Practices

Positive:

  • Good use of #[serde(default)] for backward compatibility on the new reschedule_ts field
  • Excellent documentation additions to OpenAPI schema and types - the timestamp field descriptions are very helpful
  • Refactoring backoff logic into a helper function (reschedule_backoff) reduces code duplication
  • Proper handling of state updates in the compare_retry activity

Issues:

  1. Inconsistent field ordering - In engine/packages/pegboard/src/ops/actor/get.rs:87-94, the field order was changed which could affect readability:
// Before: create_ts, pending_allocation_ts, start_ts, sleep_ts, connectable_ts, destroy_ts
// After:  create_ts, start_ts, pending_allocation_ts, connectable_ts, sleep_ts, reschedule_ts, destroy_ts

While this new ordering is more chronological (create → start → pending → connectable → sleep → reschedule → destroy), this change seems unrelated to the main feature. Consider keeping this in a separate refactoring PR or document the rationale in the PR description.

  1. Missing context in compare_retry - The activity now has side effects on state.reschedule_ts, but this isn't reflected in the activity name. Consider renaming to something like compare_and_update_retry or adding a comment explaining the state mutation.

Potential Bugs

Critical:

  1. Variable shadowing/unused variable in engine/packages/gasoline/src/ctx/standalone.rs:97:
let mut ctx = StandaloneCtx::new(...)?;
ctx.from_workflow = true;
Ok(ctx)  // This returns ctx, not new_ctx

The original code had Ok(new_ctx) which would fail since new_ctx doesn't exist. Good catch on the fix! However, this appears to be an unrelated bug fix that should ideally be in a separate PR or at minimum mentioned in the PR description.

Medium:

  1. Potential integer overflow in engine/packages/pegboard/src/workflows/actor/runtime.rs:735:
state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);

While try_from will catch usize → i64 overflow, there's no check for now + duration overflow. Given that timestamps are in milliseconds, this is unlikely in practice but worth noting. Consider using checked_add for safety:

state.reschedule_ts = Some(
    now.checked_add(i64::try_from(backoff.current_duration())?)
        .ok_or_else(|| anyhow\!("timestamp overflow"))?
);

Minor:

  1. Unused mut in reschedule_actor at line 622:
let mut backoff = reschedule_backoff(state.reschedule_state.retry_count);

The backoff variable is only used to call .step() once, but it's declared as mut. While .step() does mutate the backoff, the result is discarded. This is likely intentional but could be clarified with a comment or by removing mut if mutation isn't needed.


Performance Considerations

Good:

  • The refactored backoff calculation is more efficient by avoiding duplicate backoff object creation
  • Setting reschedule_ts in the database state happens within an existing activity, avoiding extra round trips

Neutral:

  • Adding a new field to the state doesn't materially impact performance
  • The current_duration() method in backoff.rs is efficient (simple exponential calculation)

Minor concern:

  • The compare_retry activity is now doing more work (setting reschedule_ts), but this is negligible

Security Concerns

No security issues identified. The changes are internal state management with no user input handling or privilege escalation paths.


Test Coverage

Missing:

  • No tests for the new reschedule_ts functionality
  • No tests verifying that reschedule_ts is correctly set/cleared during retry reset
  • No tests for the start_ts idempotency change (line 750-752)

Recommendations:

  1. Add unit tests for the compare_retry activity covering:
    • reschedule_ts is set correctly when not resetting
    • reschedule_ts is cleared when resetting
    • Backoff calculation is correct
  2. Add integration tests for actor rescheduling that verify reschedule_ts in the database
  3. Test the start_ts idempotency: verify that an actor started multiple times only sets start_ts once

Additional Observations

  1. Documentation in OpenAPI - Great addition of descriptions for all timestamp fields! This will help API consumers understand the actor lifecycle.

  2. start_ts idempotency - The change at runtime.rs:750-752 makes start_ts only set once:

if state.start_ts.is_none() {
    state.start_ts = Some(util::timestamp::now());
}

This is a subtle behavioral change. Previously, start_ts would update on every set_started call. Now it truly represents "first started time". This seems correct based on the field documentation ("Denotes when the actor was first made connectable"), but it's worth calling out in the PR description.

  1. Backoff refactoring - The change in backoff.rs:88 is good:
// Before: let next_wait = self.wait * 2usize.pow(...) + randomness
// After:  let next_wait = self.current_duration() + randomness

This ensures consistent calculation between current_duration() and step().


Recommendations

Must fix:

  • Add tests for reschedule_ts functionality

Should fix:

  • Handle potential timestamp overflow in reschedule_ts calculation
  • Add PR description explaining the unrelated fixes (gasoline context, field ordering, start_ts idempotency)

Nice to have:

  • Consider renaming compare_retry to reflect its side effects
  • Add inline comment explaining why backoff is mutable in reschedule_actor

Overall, this is a solid PR that adds useful observability for actor rescheduling. The main gaps are test coverage and some minor clarity issues. Great work on the documentation improvements!

@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: feat(pb): add reschedule_ts to actors

Summary

This PR adds a reschedule_ts field to actor state to track when an actor will attempt to allocate again after a failure. The PR also includes improvements to the backoff logic and fixes a bug in the gasoline standalone context.


Code Quality & Best Practices

Positive:

  • Good use of structured documentation comments following the project's conventions (e.g., "Denotes when..." pattern for timestamp fields in packages/types/src/actors.rs:18-31)
  • Proper use of #[serde(default)] attribute for backward compatibility on new fields (packages/pegboard/src/workflows/actor/runtime.rs:718, packages/pegboard/src/workflows/actor/mod.rs:61)
  • Consistent timestamp naming following the *_at convention (though it's *_ts in this codebase)
  • Good extraction of backoff logic into a reusable function reschedule_backoff() at packages/pegboard/src/workflows/actor/runtime.rs:813-815

Issues:

  1. Bug in standalone.rs:97 - CRITICAL

    Ok(new_ctx)  // Changed to: Ok(ctx)

    The variable new_ctx is never defined in this function. This appears to be a copy-paste error from line 83 where the correct variable is ctx. However, looking at the context, this is actually being fixed in this PR - good catch!

  2. Inconsistent field ordering in get.rs:90-94
    The field order was changed, but there's no clear reason for the reordering. The new order groups timestamps logically (create → start → pending_allocation → connectable → sleep → reschedule → destroy), which is an improvement.


Potential Bugs & Issues

  1. Race condition in compare_retry activity (runtime.rs:724-739)

    The reschedule_ts is being set in the compare_retry activity based on the current retry count, but there's a potential race if the actor is destroyed or modified between when this activity runs and when the state is persisted. However, since this uses the workflow state mutation pattern, this should be safe.

  2. Integer overflow potential (runtime.rs:735)

    state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);

    While try_from will catch conversion errors, there's still potential for overflow when adding to now. Consider checking if the addition would overflow:

    let duration_ms = i64::try_from(backoff.current_duration())?;
    state.reschedule_ts = Some(now.checked_add(duration_ms)
        .ok_or_else(|| anyhow!("timestamp overflow calculating reschedule_ts"))?);
  3. Backoff calculation in backoff.rs:88

    The refactoring extracts current_duration() for reuse, which is good. However, the original calculation was:

    self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32) + randomness

    The new version calls current_duration() which doesn't include randomness, then adds randomness. This is correct, but the duplication of the randomness addition logic in both tick() and step() could lead to inconsistencies if changed later. Consider extracting the full calculation including randomness.

  4. Missing validation of reschedule_ts consistency

    When reschedule_ts is set, there's no validation that it's in the future or that it's reasonable. While the backoff logic should ensure this, defensive programming suggests adding a sanity check.


Performance Considerations

  1. OpenAPI schema size - The additions to openapi.json are minimal and well-structured. No performance concerns.

  2. Database writes - The reschedule_ts field is added to the actor state, which is already being written during the compare_retry activity. No additional database operations are introduced.

  3. Backoff calculation caching - The reschedule_backoff() function creates a new Backoff instance each time it's called. Since it's called multiple times in reschedule_actor() and compare_retry(), consider if the backoff could be calculated once and reused. However, given the function is lightweight, this is a minor concern.


Security Concerns

No security issues identified. The changes are internal state management and don't expose new attack surfaces.


Test Coverage

CRITICAL ISSUE: No tests included

This PR modifies critical actor scheduling logic but includes no tests. The following should be tested:

  1. Unit tests needed:

    • reschedule_ts is correctly calculated based on retry count
    • reschedule_ts is cleared when retry count is reset
    • Backoff calculation is correct for various retry counts
    • Edge cases: retry count = 0, maximum retry count, overflow scenarios
  2. Integration tests needed:

    • Actor reschedule flow with the new timestamp
    • Verify reschedule_ts appears in API responses (get.rs, list_for_ns.rs)
    • Verify behavior when reschedule_ts is in the future vs past
  3. Regression tests needed:

    • Ensure backward compatibility with actors that don't have reschedule_ts set (the #[serde(default)] should handle this, but it should be tested)

Recommendation: Add at least basic unit tests for the backoff calculation and integration tests for the reschedule flow before merging.


Additional Observations

  1. Documentation in OpenAPI - Excellent addition of descriptions for all timestamp fields. This will greatly help API consumers understand the actor lifecycle.

  2. API contract change - Adding reschedule_ts to the Actor type is a backward-compatible addition since it's optional. Existing clients will simply ignore the new field.

  3. Naming consistency - The PR consistently uses reschedule_ts throughout, following the existing pattern. Good adherence to conventions.

  4. Missing changelog/migration notes - Consider documenting:

    • What reschedule_ts represents for operators/users
    • Any implications for monitoring or debugging

Recommendations

Before merging:

  1. ✅ Fix addressed: The standalone.rs bug fix looks correct
  2. ⚠️ Add tests - This is critical for such important scheduling logic
  3. ⚠️ Add overflow protection for timestamp calculation
  4. 📝 Consider adding a code comment explaining why reschedule_ts is set in compare_retry rather than in reschedule_actor

Nice to have:

  • Add integration test showing the full reschedule flow
  • Document the reschedule backoff behavior in a comment near BASE_RETRY_TIMEOUT_MS
  • Consider adding metrics/logging for when actors are in the rescheduled state

Verdict

The PR is well-structured and follows project conventions, but it lacks test coverage for critical scheduling logic. The bug fix in standalone.rs is correct. I recommend:

  1. Request changes - Add basic test coverage before merging
  2. Consider the integer overflow protection
  3. After tests are added, this should be safe to merge

Great work on the documentation additions and the clean implementation of the state tracking!

@claude
Copy link

claude bot commented Nov 5, 2025

Code Review for PR #3363: feat(pb): add reschedule_ts to actors

Overview

This PR adds a reschedule_ts field to actors to track when an actor will attempt to allocate again after a failed allocation attempt. This provides better visibility into actor scheduling state and retry timing.


Positive Aspects ✅

  1. Good Documentation: The new field has clear documentation explaining its purpose in both the OpenAPI spec and the Rust types
  2. Backward Compatibility: Uses #[serde(default)] attribute to ensure backward compatibility with existing state
  3. Clean Refactoring: The reschedule_backoff helper function is properly extracted and reused
  4. Consistent Timestamp Naming: Follows the existing *_ts convention for timestamp fields
  5. Proper Integration: Field is properly propagated through all relevant code paths (get, list operations)
  6. Improved OpenAPI Docs: Added helpful descriptions to multiple timestamp fields, not just the new one

Issues & Concerns 🔴

1. Critical: Potential Integer Overflow (runtime.rs:735)

state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);

Issue: While try_from handles conversion overflow, the addition itself could overflow if now is very large and current_duration() is near the i64 max.

Recommendation: Use checked arithmetic:

state.reschedule_ts = Some(
    now.checked_add(i64::try_from(backoff.current_duration())?)
        .ok_or_else(|| anyhow::anyhow\!("reschedule_ts overflow"))?
);

2. Bug: Incorrect Backoff Calculation (backoff.rs:88)

let next_wait = self.current_duration() + rand::thread_rng().gen_range(0..self.randomness);

Issue: The refactored code in step() now adds randomness AFTER calling current_duration(), but the randomness should be added to the NEXT wait calculation, not reused from the previous one. The old code calculated the next wait correctly as:

let next_wait = self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32) + randomness

But now it uses current_duration() which calculates based on the CURRENT self.i, not the next one. This is a subtle behavior change.

Impact: The backoff timing will be incorrect - it will use the backoff for iteration i instead of iteration i+1.

Recommendation: Revert this change or ensure current_duration() is called after incrementing i.

3. Inconsistent Timestamp Clearing (runtime.rs:732)

if reset {
    state.reschedule_ts = None;
}

Issue: When the retry count is reset, reschedule_ts is cleared, but this happens in compare_retry activity BEFORE the actual reschedule attempt. If reschedule fails again immediately, the field won't be set until the next retry.

Recommendation: Consider clearing this field when an actor successfully allocates, not when resetting retry count. This would provide more accurate state.

4. Missing Field Clearing

The reschedule_ts field is only set in compare_retry but is never explicitly cleared when:

  • An actor successfully allocates
  • An actor is destroyed
  • An actor enters sleep mode

Recommendation: Add explicit clearing in appropriate activities:

  • In allocate_actor when allocation succeeds (runtime.rs:366-374)
  • In set_started when actor becomes connectable (runtime.rs:747-769)
  • In set_sleeping (runtime.rs:777-799)

5. Versioning Missing

The compare_retry activity has been modified with a new input field:

#[serde(default)]
retry_count: usize,

While #[serde(default)] handles backward compatibility, the activity is marked as v(2) but there's no indication this was a breaking change or what happened to v1.

Recommendation: Add a comment explaining the version change and what differs from v1.


Code Quality Observations 📋

Good:

  1. Follows CLAUDE.md conventions: Lowercase log messages, structured tracing
  2. Proper error handling: Uses anyhow Result types consistently
  3. Good state management: Uses workflow state properly

Could Improve:

  1. Test Coverage: No tests were added for this new functionality. Consider adding tests that verify:
    • reschedule_ts is set correctly when rescheduling
    • reschedule_ts is cleared when appropriate
    • The backoff calculation produces expected values
  2. Documentation: While the field is documented, the overall rescheduling logic could benefit from comments explaining when/why this timestamp is set

Performance Considerations ⚡

  • Negligible impact: Adding a single optional timestamp field has minimal performance overhead
  • Serialization cost: The #[serde(default)] attribute is appropriate and efficient

Security Concerns 🔒

  • No security issues identified: The changes are internal state tracking only
  • Timestamp values are computed internally, not user-provided

Recommendations Summary

Must Fix (before merge):

  1. Fix the backoff calculation bug in backoff.rs:88 - this changes behavior
  2. Add checked arithmetic for potential overflow in runtime.rs:735

Should Fix:
3. Clear reschedule_ts when actor successfully allocates/starts
4. Add tests for the new field and rescheduling behavior

Nice to Have:
5. Add comments explaining the v2 activity change
6. Consider adding logs when reschedule_ts is set to help with debugging


Verdict

The PR adds useful observability for actor scheduling, but has two critical bugs that need to be addressed:

  1. The backoff calculation change
  2. Missing overflow protection

Please address these issues before merging. Once fixed, this will be a solid improvement to actor state tracking.

@claude
Copy link

claude bot commented Nov 5, 2025

Code Review

Summary

This PR adds a reschedule_ts field to track when an actor will attempt to allocate again after a failure. This provides better observability into the actor rescheduling backoff mechanism.

Positive Aspects

Good documentation: The OpenAPI schema and type definitions include clear descriptions of what reschedule_ts represents
Backward compatibility: Uses #[serde(default)] for the new field, ensuring compatibility with existing workflow state
Consistent naming: Follows the established *_ts convention for timestamps
Helper function extraction: The reschedule_backoff() helper improves code reusability

Issues & Concerns

1. Critical Bug: Incorrect backoff calculation in set_started

// engine/packages/pegboard/src/workflows/actor/runtime.rs:750-752
if state.start_ts.is_none() {
    state.start_ts = Some(util::timestamp::now());
}

Issue: The set_started activity now only sets start_ts on the first start, but this changes existing behavior. Previously, start_ts was always updated to the current timestamp on every start.

Impact:

  • start_ts now represents "first start time" rather than "last start time"
  • This changes the semantics from the docstring which says "when the actor was first made connectable"
  • Could break monitoring/metrics that depend on start_ts being updated

Recommendation: This appears to be an intentional change to make start_ts represent the first start time, which aligns with the new documentation. However, this behavioral change should be:

  1. Explicitly documented in the PR description
  2. Verified that no existing code depends on the old behavior
  3. Consider if a migration is needed for existing actors

2. Potential Logic Issue: reschedule_ts calculation timing

// engine/packages/pegboard/src/workflows/actor/runtime.rs:731-736
if reset {
    state.reschedule_ts = None;
} else {
    let backoff = reschedule_backoff(input.retry_count);
    state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);
}

Issue: The reschedule_ts is calculated in compare_retry but the actual sleep happens in reschedule_actor after the activity returns. The sequence is:

  1. compare_retry calculates reschedule_ts = now + backoff.current_duration()
  2. Then in reschedule_actor, backoff.step() is called which adds randomness and increments the counter
  3. The actual sleep time may not match the stored reschedule_ts

Impact: reschedule_ts may be slightly inaccurate (off by the randomness factor of 0-500ms) and doesn't account for the time between the activity and the actual sleep.

Recommendation: Either:

  • Calculate reschedule_ts using backoff.step() instead of current_duration() (but handle the side effects)
  • Document that reschedule_ts is approximate
  • Or recalculate it after the actual sleep time is determined

3. Missing update in workflow state initialization

The new field is added to State::new() initialization but should verify all paths that create or modify state include proper handling of this field.

4. Test Coverage

Missing: No tests were added for the new reschedule_ts field behavior.

Recommendation: Add tests to verify:

  • reschedule_ts is set correctly during rescheduling
  • reschedule_ts is cleared on retry reset
  • reschedule_ts is included in API responses
  • The field handles migration from existing actors without the field

Minor Issues

5. Backoff refactoring in util/backoff.rs

// Before:
let next_wait = self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32)
    + rand::thread_rng().gen_range(0..self.randomness);

// After:
let next_wait = self.current_duration() + rand::thread_rng().gen_range(0..self.randomness);

Good: Reduces code duplication by reusing current_duration()
Note: This is a refactoring improvement but unrelated to the main PR purpose

Performance Considerations

✅ No significant performance impact - just adds one field to the state

Security Considerations

✅ No security concerns - timestamp is informational only

Recommendations

Must Fix:

  1. Document the start_ts behavioral change and verify no breaking changes

Should Fix:
2. Add test coverage for the new field
3. Verify or document that reschedule_ts is approximate
4. Consider adding the field to any logging statements about rescheduling for better debugging

Nice to Have:
5. Add the reschedule_ts to tracing logs when rescheduling occurs:

tracing::debug!(?reschedule_ts, "rescheduling actor");

Overall Assessment

The PR adds useful observability, but has a subtle behavioral change to start_ts that needs verification and documentation. The reschedule_ts calculation has a minor timing issue that should be documented or addressed.

Status: ⚠️ Needs revision - primarily to document/verify the start_ts change and add tests

@graphite-app graphite-app bot changed the base branch from 11-04-fix_fix_all_dashboards to graphite-base/3363 November 5, 2025 02:29
@graphite-app graphite-app bot force-pushed the graphite-base/3363 branch from c934952 to d3a8681 Compare November 5, 2025 02:30
@graphite-app graphite-app bot force-pushed the 11-04-feat_pb_add_reschedule_ts_to_actors branch from 7547022 to 4d3c3c9 Compare November 5, 2025 02:30
@graphite-app graphite-app bot changed the base branch from graphite-base/3363 to main November 5, 2025 02:31
@graphite-app graphite-app bot force-pushed the 11-04-feat_pb_add_reschedule_ts_to_actors branch from 4d3c3c9 to b8fface Compare November 5, 2025 02:31
@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: Add reschedule_ts to actors

Summary

This PR adds a reschedule_ts field to track when an actor will attempt to allocate again after rescheduling failures. The field provides visibility into the backoff scheduling state.

Code Quality & Best Practices

Strengths

  1. Good documentation: Added clear descriptions to OpenAPI schema and the Actor type explaining the purpose of reschedule_ts
  2. Backwards compatibility: Used #[serde(default)] on new field to handle existing serialized state gracefully
  3. Consistent naming: Follows the established *_ts convention for timestamps
  4. Appropriate data type: Uses i64 for epoch timestamp in milliseconds, consistent with other timestamp fields

🔍 Code Quality Issues

1. Backoff calculation refactoring (packages/util/src/backoff.rs:88)

The refactor extracts current_duration() calculation, which is good for DRY, but there's a subtle behavior change:

// Old (line 88-89):
let next_wait = self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32)
    + rand::thread_rng().gen_range(0..self.randomness);

// New (line 88):
let next_wait = self.current_duration() + rand::thread_rng().gen_range(0..self.randomness);

Issue: The old code added randomness to the calculated duration. The new code does the same, so this is correct - good refactor! ✓

2. State mutation timing in compare_retry activity (packages/pegboard/src/workflows/actor/runtime.rs:731-736)

The compare_retry activity now mutates state.reschedule_ts based on whether the retry should reset:

if reset {
    state.reschedule_ts = None;
} else {
    let backoff = reschedule_backoff(input.retry_count);
    state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);
}

Concern: This calculation happens during the activity but uses input.retry_count which is the current retry count, not the next retry count. Looking at the caller (line 622-637), retry_count is incremented after this activity returns. This means reschedule_ts may be off by one backoff level.

Recommendation: Either:

  • Calculate reschedule_ts after incrementing retry count in the workflow
  • Pass the next retry count to the activity
  • Document if this is intentional (calculating for current attempt vs next attempt)

3. start_ts only set on first start (packages/pegboard/src/workflows/actor/runtime.rs:750-752)

if state.start_ts.is_none() {
    state.start_ts = Some(util::timestamp::now());
}

Question: This changes behavior so start_ts is only set once. The doc comment says "when the actor was first made connectable" but previously this would update on every set_started call. Is this intentional?

If yes, great! If not, this is a behavioral change that should be highlighted.

Potential Bugs

None identified - The code appears functionally correct, though see timing concern above.

Performance Considerations

✅ No concerns - The changes are minimal:

  • One additional optional field in serialization
  • Simple timestamp calculation with no blocking operations
  • Backoff refactoring maintains same complexity

Security Concerns

✅ No concerns - The reschedule_ts field is:

  • Read-only metadata (no user input)
  • Internal scheduling information
  • Properly validated through type system (i64)

Test Coverage

⚠️ Missing Tests

No tests were added for:

  1. reschedule_ts is correctly set during rescheduling
  2. reschedule_ts is cleared on reset
  3. reschedule_ts correctly reflects the backoff duration
  4. Serialization/deserialization with missing field (backwards compat)

Recommendation: Add tests to verify the reschedule_ts field behavior, especially:

  • Value is set during reschedule with proper backoff calculation
  • Value is cleared when retry count resets
  • Proper handling of #[serde(default)] for existing workflows

Additional Observations

  1. API exposure: The field is exposed in OpenAPI, which is good for observability. Consider if this should be used in any API filtering/sorting operations.

  2. Consistency: All timestamp fields in Actor now have descriptions - excellent improvement!

  3. Helper function: The reschedule_backoff helper (line 813-815) is well-placed and improves code reusability.

Recommendations Summary

  1. High Priority: Clarify/fix the timing of reschedule_ts calculation vs retry_count increment
  2. Medium Priority: Add test coverage for the new field
  3. Low Priority: Document the intentional change to start_ts behavior if that was intended

Overall, this is a solid PR that improves observability into actor rescheduling. The main concern is the potential off-by-one issue in the reschedule timestamp calculation.

@claude
Copy link

claude bot commented Nov 5, 2025

Code Review: feat(pb): add reschedule_ts to actors

Overview

This PR adds a reschedule_ts field to track when an actor will attempt allocation again. The implementation includes proper serialization defaults, API documentation, and timestamp management.

Positive Observations

  1. Backward Compatibility: Using #[serde(default)] on the new reschedule_ts field ensures graceful handling of existing state without breaking deserialization. This is crucial for workflow state management.

  2. Comprehensive Documentation: Added clear, consistent documentation for all timestamp fields in the Actor struct and OpenAPI schema, improving API discoverability.

  3. Proper Timestamp Management: The reschedule_ts is correctly set in compare_retry (runtime.rs:734-736) and cleared on reset (runtime.rs:732), maintaining accurate state.

  4. Code Extraction: The reschedule_backoff helper function (runtime.rs:813-815) eliminates duplication and improves maintainability.

  5. Idempotency Fix: The change in set_started (runtime.rs:750-752) to only set start_ts once is a good defensive improvement, though unrelated to the main feature.

Issues and Concerns

1. Bug in Backoff Calculation ⚠️

Location: engine/packages/util/src/backoff.rs:88

The refactoring introduces a subtle bug:

// Before (correct):
let next_wait = self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32)
    + rand::thread_rng().gen_range(0..self.randomness);

// After (incorrect):
let next_wait = self.current_duration() + rand::thread_rng().gen_range(0..self.randomness);

Problem: The backoff calculation changed from "calculate exponential backoff + randomness" to "current duration + randomness". The formula is identical, but this refactoring was done in step() only, not in tick(). However, looking at line 72, tick() was actually updated correctly. So this is fine. ✓

Wait, reviewing again - line 88 in the diff shows:

let next_wait = self.current_duration() + rand::thread_rng().gen_range(0..self.randomness);

And current_duration() returns:

self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32)

So this is correct - both tick() and step() now use the same calculation pattern. ✓

2. Missing State Update Test

The new reschedule_ts field affects actor state transitions, but there's no explicit test coverage for:

  • Verifying reschedule_ts is set correctly based on backoff
  • Confirming reschedule_ts is cleared on retry reset
  • Checking reschedule_ts exposure through the API

Recommendation: Add integration tests covering the reschedule timestamp lifecycle.

3. Timestamp Calculation Precision

Location: engine/packages/pegboard/src/workflows/actor/runtime.rs:735

state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);

The calculation adds milliseconds from current_duration() (returns usize) to an i64 timestamp. The try_from could theoretically fail on 32-bit systems (unlikely in practice), but the error would be opaque.

Minor suggestion: Consider adding context to the conversion error, though this is very low priority given the deployment environment.

4. API Consistency

The reschedule_ts field is now exposed through:

  • pegboard_actor_get (runtime.rs:687)
  • pegboard_actor_list_for_ns (runtime.rs:200)
  • OpenAPI schema

This is good, but verify that consumers understand this field may be set even when pending_allocation_ts is null (during backoff periods).

Documentation suggestion: Add a note in the OpenAPI description clarifying the relationship between reschedule_ts and pending_allocation_ts.

Minor Observations

  1. Field Ordering: The timestamp fields in Actor are now better organized chronologically, which improves readability.

  2. Activity Input Evolution: Adding retry_count to CompareRetryInput with #[serde(default)] maintains backward compatibility with existing activity invocations.

Performance and Security

  • ✅ No performance concerns introduced
  • ✅ No security issues identified
  • ✅ Database operations remain properly transactional
  • ✅ No new error paths that could leak sensitive information

Recommendations

  1. Add Tests: Include integration tests that verify:

    • reschedule_ts calculation matches backoff expectations
    • State transitions properly set/clear the timestamp
    • API responses correctly include the field
  2. API Documentation Enhancement: Clarify the relationship between reschedule_ts and actor allocation state in the OpenAPI description.

  3. Consider Logging: Add a debug log when setting reschedule_ts to aid in production debugging:

    state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);
    tracing::debug!(reschedule_ts = state.reschedule_ts, retry_count = input.retry_count, "set actor reschedule timestamp");

Verdict

This is a solid implementation that properly extends actor state tracking with good backward compatibility practices. The main gaps are in test coverage and API documentation clarity. The code changes are well-structured and follow the project's patterns.

Approval Status: ✅ Approve with recommendations for follow-up improvements.


🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 5, 2025

Pull Request Review: Add reschedule_ts to actors

Overview

This PR adds a reschedule_ts field to track when an actor will attempt to allocate again after a failure, improving observability and providing better control over retry timing.


Positive Aspects

  1. Good Documentation: The OpenAPI schema and type definitions include clear descriptions of the new field's purpose
  2. Backward Compatibility: Using #[serde(default)] ensures existing serialized state can be deserialized without errors
  3. Consistent Field Ordering: The new field is placed logically with other timestamp fields in the Actor struct
  4. Code Refactoring: Extracting reschedule_backoff() into a helper function (lines 813-815 in runtime.rs) improves code reusability

Issues & Concerns

1. Logic Bug in compare_retry activity ⚠️

Location: engine/packages/pegboard/src/workflows/actor/runtime.rs:734-735

let backoff = reschedule_backoff(input.retry_count);
state.reschedule_ts = Some(now + i64::try_from(backoff.current_duration())?);

Problem: The reschedule_ts is being set to now + current_duration(), but current_duration() returns the duration for the current retry attempt (at retry_count), not the next one. This appears to be a timing mismatch.

Context: Looking at the flow in reschedule_actor (lines 622-641):

  1. compare_retry is called with the current retry_count
  2. Then retry_count is incremented (line 632-636)
  3. The backoff sleep uses backoff.step() which advances to the next duration (line 641)

Issue: The reschedule_ts is being calculated based on the current retry count's duration, but by the time it's used, the retry count has already been incremented. This means reschedule_ts reflects the previous attempt's duration, not the upcoming one.

Suggested Fix: Either:

  • Calculate reschedule_ts after incrementing retry_count, or
  • Use backoff.step() or manually calculate the next duration: wait * 2^((retry_count + 1).min(max_exponent))

2. Inconsistency in backoff.rs change 🤔

Location: engine/packages/util/src/backoff.rs:72-73, 88

The change consolidates the backoff calculation to use current_duration():

// Before
let next_wait = self.wait * 2usize.pow(self.i.min(self.max_exponent) as u32)
    + rand::thread_rng().gen_range(0..self.randomness);

// After  
let next_wait = self.current_duration() + rand::thread_rng().gen_range(0..self.randomness);

Question: Was this change intentional as a cleanup/refactoring, or was it necessary for the feature? The commit doesn't mention this change specifically. While it's a good refactoring (removes duplication), it changes both tick() and step() methods simultaneously.

Recommendation: If this is purely refactoring, consider mentioning it in the commit message. The change is good, but unexpected in a PR focused on adding reschedule_ts.


3. start_ts Behavior Change ⚠️

Location: engine/packages/pegboard/src/workflows/actor/runtime.rs:750-752

if state.start_ts.is_none() {
    state.start_ts = Some(util::timestamp::now());
}

Issue: This changes the behavior of set_started to only set start_ts on the first call. Previously, it would update on every call.

Questions:

  • Is this intentional?
  • Does this relate to the reschedule_ts feature or is it a separate bug fix?
  • Should this be in a separate commit/PR for clarity?

Documentation: The comment says "Denotes when the actor was first made connectable", but the code previously allowed updating this. The new behavior matches the documentation better, but it's unclear if this behavior change is intentional or a side effect.


4. Missing Test Coverage 🧪

The PR doesn't include tests for the new functionality. Consider adding tests for:

  • reschedule_ts is correctly set and updated during retry cycles
  • reschedule_ts is cleared when retry count is reset
  • The backoff calculation produces expected timestamps
  • Edge cases (e.g., reset vs. increment scenarios)

Per CLAUDE.md guidelines: "When running tests, always pipe the test to a file in /tmp/ then grep it in a second step."


5. Minor: Inconsistent timestamp documentation 📝

The timestamp field descriptions in actors.rs are good, but some could be more precise:

  • connectable_ts: "Null if actor is not running" - could be "Null if actor is not currently running/connectable"
  • start_ts: "Null if never" - could be "Null if actor was never started"

These are nitpicks and the current documentation is acceptable.


Performance & Security

Performance: No concerns. The field is efficiently stored and accessed.
Security: No security concerns identified.


Recommendations

Priority 1 - Must Fix:

  1. Review and fix the reschedule_ts calculation logic in compare_retry to ensure it reflects the correct next retry time
  2. Verify the start_ts behavior change is intentional and document it

Priority 2 - Should Address:
3. Add test coverage for the new reschedule_ts field and retry logic
4. Clarify if the backoff.rs change is intentional refactoring

Priority 3 - Nice to Have:
5. Consider more precise timestamp documentation


Summary

The PR adds valuable observability to the actor rescheduling system, but there's a potential timing bug in how reschedule_ts is calculated that should be addressed before merging. The code structure and documentation are generally good, but test coverage would strengthen confidence in the implementation.

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 5, 2025

Merge activity

  • Nov 5, 5:07 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 5, 5:08 AM UTC: CI is running for this pull request on a draft pull request (#3371) due to your merge queue CI optimization settings.
  • Nov 5, 5:11 AM UTC: Merged by the Graphite merge queue via draft PR: #3371.

graphite-app bot pushed a commit that referenced this pull request Nov 5, 2025
@graphite-app graphite-app bot closed this Nov 5, 2025
@graphite-app graphite-app bot deleted the 11-04-feat_pb_add_reschedule_ts_to_actors branch November 5, 2025 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants