-
Notifications
You must be signed in to change notification settings - Fork 38
Retry and replay for agent runs #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements a comprehensive retry and replay mechanism for agent runs to recover from terminal failures (e.g., spot instance evictions, container deletions) and timeouts. The feature automatically retries failed tasks up to a configurable limit, replaying previously executed actions to minimize lost progress.
Changes:
- Added
--max-retriesCLI argument (default: 3) to configure retry attempts with automatic trajectory replay - Enhanced exception handling to attach environment state to
UnrecoverableTerminalErrorfor better recovery - Improved progress tracking to use set-based completion tracking, preventing duplicate counts during retries
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/run.py |
Implements the retry loop with trajectory saving/loading and action replay on terminal/timeout failures |
debug_gym/agents/base_agent.py |
Adds replay_actions parameter to run() method and updates execute_action() to record failed steps in history |
debug_gym/agents/utils.py |
Adds load_trajectory() function to reconstruct LLMResponse objects from saved trajectories and --max-retries CLI argument |
debug_gym/logger.py |
Refactors completion tracking from counter to set to prevent duplicate counting during retries |
debug_gym/gym/terminals/terminal.py |
Enhances UnrecoverableTerminalError to accept and store env_info for replay purposes |
debug_gym/gym/envs/env.py |
Attaches env_info to UnrecoverableTerminalError before re-raising; removes unnecessary f-string |
tests/agents/test_utils.py |
Adds comprehensive tests for load_trajectory() covering various edge cases |
tests/agents/test_base_agent.py |
Adds tests for execute_action() exception handling and replay functionality |
tests/gym/envs/test_unrecoverable_terminal.py |
Updates test to verify exception now includes env_info |
Comments suppressed due to low confidence (1)
scripts/run.py:179
- The code unconditionally references
agentandenvon lines 171 and 176, but if the retry loop exits due to a KeyboardInterrupt (line 157) or if all retries fail before agent/env are created, these variables may be undefined, resulting in a NameError. Initialize both variables to None before the loop and add existence checks before using them.
# save trajectory
save_trajectory(agent, task_path, task_logger)
# optionally apply patch
if config.get("save_patch", True):
try:
save_patch(env, task_path, task_logger)
except Exception as patch_error:
# Terminal may be unavailable (e.g., pod died), log and continue
task_logger.warning(f"Could not save patch: {patch_error!r}")
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MarcCote
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Overall LGTM.
…nsure trajectories and patches are saved on errors
…r management and logging
This pull request implements retry and replay functionality for agent runs in the event of terminal or timeout failures. It introduces a mechanism to automatically retry failed tasks up to a configurable limit, replaying previous actions to ensure continuity and minimize lost progress. The changes also improve error handling, progress tracking, and trajectory management throughout the agent execution lifecycle.
The most important changes are:
Retry and Replay Mechanism:
--max-retriesargument to allow configuration of the maximum number of retries for terminal or timeout failures, with replay of previous steps on retry (debug_gym/agents/utils.py,scripts/run.py). [1] [2]scripts/run.pyto catchUnrecoverableTerminalErrorandAgentTimeoutException, save the current trajectory, reload actions from previous attempts, and replay them before generating new actions on retry. [1] [2]BaseAgent.runto accept areplay_actionsargument, replaying previous LLM responses before proceeding with new steps, and logging replay details for traceability (debug_gym/agents/base_agent.py). [1] [2]Trajectory Management:
load_trajectoryutility to reconstructLLMResponseobjects from saved trajectories, enabling accurate action replay (debug_gym/agents/utils.py).scripts/run.py).Error Handling Improvements:
env_infoto exceptions, enabling better recovery and replay (debug_gym/gym/envs/env.py,debug_gym/gym/terminals/terminal.py,debug_gym/agents/base_agent.py). [1] [2] [3]BaseAgent.runandscripts/run.pyto avoid duplicate error reporting and ensure accurate progress logging. [1] [2] [3] [4] [5]Progress Tracking:
DebugGymLoggerto more accurately track completed tasks using a set of completed task IDs, ensuring correct progress reporting in the presence of retries (debug_gym/logger.py). [1] [2] [3]Minor Fixes and Refactoring:
debug_gym/agents/base_agent.py,debug_gym/agents/utils.py,debug_gym/gym/envs/env.py,tests/agents/test_utils.py). [1] [2] [3] [4]