Improve the extraction agent reliability, observability, and retry behavior by Xreki · Pull Request #709 · PaddlePaddle/GraphNet

Xreki · 2026-05-15T04:14:44Z

PR Category

Other

Description

新增结构化错误分类，并在批量抽取结果中输出 error_category
优化 LLM retry 策略，仅对脚本执行失败进行修复重试
避免对超时、模型不可达等不可修复错误进行无效 LLM retry
在下载前通过 HuggingFace API 检查模型仓库可达性
增强抽取子进程超时清理和孤儿进程清理，减少 GPU 资源泄漏
增加 verification timeout 的独立统计和日志展示
恢复 CPU verify timeout 和 LLM fix timeout 的默认值
约束 LLM 生成更精简的 run_model.py 修复脚本
将 graph_net/agent 下的中文代码注释改为英文

- CPU forward verification often takes 1000s+ for large models - Update --verify-timeout help text accordingly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

paddle-bot · 2026-05-15T04:15:25Z

Thanks for your contribution!

- LLMCodeFixer: support Optional[int] timeout, default 360s when None - GraphNetAgent: add llm_timeout parameter (default: 600s) - Remove download_timeout from previous iteration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Raise default llm_timeout from 600s to 900s to reduce ducc -p timeout failures. - Treat forward verification timeout as pass for large models on CPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- ForwardVerifier now records last_timeout_success when eager forward passes are skipped due to subprocess timeout. - GraphNetAgent propagates this flag via last_timeout_success attribute. - parallel_extract worker reports timeout_success per model. - PROGRESS line format: success=xx%(timeout_success=xx)% - Summary and per-GPU stats also include timeout counts/rates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add ProcessGroupTracker class to track active child process groups spawned by SubprocessGraphExtractor, enabling bulk kill via SIGKILL. - Add orphan watcher daemon thread in parallel_extract worker_fn: detects when parent dies (ppid == 1) and kills all tracked child process groups, then exits via os._exit(1) to avoid Python-level cleanup delays that could block GPU memory release. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rename exceptions to match subdirectory names: AnalysisError -> MetadataAnalysisError CodeGenError -> CodeGenerationError ExtractionError -> GraphExtractionError VerificationError -> SampleVerificationError - Add error_category to exceptions with default_category class attrs - Categorize errors at throw sites (404/403, config missing, script timeout, output missing, forward verify failed, etc.) - Introduce GraphExtractionErrorClassifier for type-safe classification - Smart LLM retry: only retry SCRIPT_EXECUTION_FAILED; skip retry for timeouts, model_not_found, model_forbidden, and LLM infra errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Move GraphExtractionErrorCategory from error_classifier.py to exceptions.py so type definitions live with the data they describe. - Change default_category and error_category from raw strings to GraphExtractionErrorCategory enum values. - Add missing categories: CONFIG_NOT_FOUND, CONFIG_PARSE_ERROR, METADATA_ANALYSIS_FAILED, VERIFICATION_FAILED. - Update all raise-sites to pass enum members instead of strings. - Remove redundant inline import in _is_llm_fixable_error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rename markdown_report() to report_lines() and return List[str] instead of a markdown-formatted string. - Remove markdown syntax (#, |, ---, etc.) for simpler consumption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Include ModelFetchError in explicit except clause so it returns EXTRACT_FAILED with proper classification instead of ERROR. - Worker reads agent.error_classifier.get_record(model_id) after extract_sample() and forwards error_category + error_message in result_dict so the main process can see which stage failed. - Fallback: if extract_sample itself raises unexpectedly, read error_category attribute from the raw exception. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep source comments and docstrings in English while preserving existing LLM prompt text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Stop retrying LLM script fixes when a repaired script fails with a category that cannot be addressed by rewriting the script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Set CPU forward verification timeout back to 600 seconds and keep CLI help text in sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use HuggingFace API metadata lookup before download to skip clearly missing or forbidden model repos while preserving download retries for transient failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ask the LLM script fixer to emit a minimal complete run_model.py without comments, fallback logic, helpers, or unrelated validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Set the default LLM script-fix timeout back to 360 seconds and update the parameter documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the unsupported repo_type argument from the model accessibility precheck so it works with the installed huggingface_hub version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Simplify generated run_model.py files and structure inputs as a dictionary literal to reduce retry prompt size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Increase CPU verify_timeout default from 600s to 1200s.

88f4a6b

- CPU forward verification often takes 1000s+ for large models - Update --verify-timeout help text accordingly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Xreki and others added 17 commits May 15, 2026 12:15

Increase LLM timeout and skip forward verify on CPU timeout.

e8c26a1

- Raise default llm_timeout from 600s to 900s to reduce ducc -p timeout failures. - Treat forward verification timeout as pass for large models on CPU. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'develop' into opt_extract_agent

1042f00

Improve prompt.

69d3826

fix(agent): change llm_timeout default back to 600

9ba299f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs(agent): translate Chinese comments to English

ca59ba9

Keep source comments and docstrings in English while preserving existing LLM prompt text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(agent): skip non-fixable LLM retry errors

99d1089

Stop retrying LLM script fixes when a repaired script fails with a category that cannot be addressed by rewriting the script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(agent): restore CPU verify timeout default

2722a94

Set CPU forward verification timeout back to 600 seconds and keep CLI help text in sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(agent): precheck HuggingFace model accessibility

d93dfdb

Use HuggingFace API metadata lookup before download to skip clearly missing or forbidden model repos while preserving download retries for transient failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(agent): constrain LLM fixer output

95e4163

Ask the LLM script fixer to emit a minimal complete run_model.py without comments, fallback logic, helpers, or unrelated validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(agent): restore LLM timeout default

6744a79

Set the default LLM script-fix timeout back to 360 seconds and update the parameter documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Xreki changed the title ~~Increase CPU verify_timeout default from 600s to 1200s.~~ Improve the extraction agent reliability, observability, and retry behavior for large-scale HuggingFace model extraction May 19, 2026

Xreki changed the title ~~Improve the extraction agent reliability, observability, and retry behavior for large-scale HuggingFace model extraction~~ Improve the extraction agent reliability, observability, and retry behavior May 19, 2026

Xreki and others added 2 commits May 19, 2026 13:45

fix(agent): support current HuggingFace model_info API

307ca47

Remove the unsupported repo_type argument from the model accessibility precheck so it works with the installed huggingface_hub version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(agent): generate minimal extraction scripts

25a0746

Simplify generated run_model.py files and structure inputs as a dictionary literal to reduce retry prompt size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

luotao1 approved these changes May 19, 2026

View reviewed changes

Xreki merged commit bc0a831 into PaddlePaddle:develop May 19, 2026
3 checks passed

Xreki deleted the opt_extract_agent branch May 19, 2026 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the extraction agent reliability, observability, and retry behavior#709

Improve the extraction agent reliability, observability, and retry behavior#709
Xreki merged 20 commits into
PaddlePaddle:developfrom
Xreki:opt_extract_agent

Xreki commented May 15, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Xreki commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

Description

Uh oh!

paddle-bot Bot commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Xreki commented May 15, 2026 •

edited

Loading