Improve the extraction agent reliability, observability, and retry behavior#709
Merged
Conversation
- CPU forward verification often takes 1000s+ for large models - Update --verify-timeout help text accordingly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for your contribution! |
- LLMCodeFixer: support Optional[int] timeout, default 360s when None - GraphNetAgent: add llm_timeout parameter (default: 600s) - Remove download_timeout from previous iteration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise default llm_timeout from 600s to 900s to reduce ducc -p timeout failures. - Treat forward verification timeout as pass for large models on CPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ForwardVerifier now records last_timeout_success when eager forward passes are skipped due to subprocess timeout. - GraphNetAgent propagates this flag via last_timeout_success attribute. - parallel_extract worker reports timeout_success per model. - PROGRESS line format: success=xx%(timeout_success=xx)% - Summary and per-GPU stats also include timeout counts/rates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ProcessGroupTracker class to track active child process groups spawned by SubprocessGraphExtractor, enabling bulk kill via SIGKILL. - Add orphan watcher daemon thread in parallel_extract worker_fn: detects when parent dies (ppid == 1) and kills all tracked child process groups, then exits via os._exit(1) to avoid Python-level cleanup delays that could block GPU memory release. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename exceptions to match subdirectory names: AnalysisError -> MetadataAnalysisError CodeGenError -> CodeGenerationError ExtractionError -> GraphExtractionError VerificationError -> SampleVerificationError - Add error_category to exceptions with default_category class attrs - Categorize errors at throw sites (404/403, config missing, script timeout, output missing, forward verify failed, etc.) - Introduce GraphExtractionErrorClassifier for type-safe classification - Smart LLM retry: only retry SCRIPT_EXECUTION_FAILED; skip retry for timeouts, model_not_found, model_forbidden, and LLM infra errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move GraphExtractionErrorCategory from error_classifier.py to exceptions.py so type definitions live with the data they describe. - Change default_category and error_category from raw strings to GraphExtractionErrorCategory enum values. - Add missing categories: CONFIG_NOT_FOUND, CONFIG_PARSE_ERROR, METADATA_ANALYSIS_FAILED, VERIFICATION_FAILED. - Update all raise-sites to pass enum members instead of strings. - Remove redundant inline import in _is_llm_fixable_error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename markdown_report() to report_lines() and return List[str] instead of a markdown-formatted string. - Remove markdown syntax (#, |, ---, etc.) for simpler consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Include ModelFetchError in explicit except clause so it returns EXTRACT_FAILED with proper classification instead of ERROR. - Worker reads agent.error_classifier.get_record(model_id) after extract_sample() and forwards error_category + error_message in result_dict so the main process can see which stage failed. - Fallback: if extract_sample itself raises unexpectedly, read error_category attribute from the raw exception. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep source comments and docstrings in English while preserving existing LLM prompt text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stop retrying LLM script fixes when a repaired script fails with a category that cannot be addressed by rewriting the script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set CPU forward verification timeout back to 600 seconds and keep CLI help text in sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use HuggingFace API metadata lookup before download to skip clearly missing or forbidden model repos while preserving download retries for transient failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ask the LLM script fixer to emit a minimal complete run_model.py without comments, fallback logic, helpers, or unrelated validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set the default LLM script-fix timeout back to 360 seconds and update the parameter documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the unsupported repo_type argument from the model accessibility precheck so it works with the installed huggingface_hub version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Simplify generated run_model.py files and structure inputs as a dictionary literal to reduce retry prompt size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
luotao1
approved these changes
May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Other
Description
error_categoryrun_model.py修复脚本graph_net/agent下的中文代码注释改为英文