Skip to content

Improve the extraction agent reliability, observability, and retry behavior#709

Merged
Xreki merged 20 commits into
PaddlePaddle:developfrom
Xreki:opt_extract_agent
May 19, 2026
Merged

Improve the extraction agent reliability, observability, and retry behavior#709
Xreki merged 20 commits into
PaddlePaddle:developfrom
Xreki:opt_extract_agent

Conversation

@Xreki

@Xreki Xreki commented May 15, 2026

Copy link
Copy Markdown
Collaborator

PR Category

Other

Description

  • 新增结构化错误分类,并在批量抽取结果中输出 error_category
  • 优化 LLM retry 策略,仅对脚本执行失败进行修复重试
  • 避免对超时、模型不可达等不可修复错误进行无效 LLM retry
  • 在下载前通过 HuggingFace API 检查模型仓库可达性
  • 增强抽取子进程超时清理和孤儿进程清理,减少 GPU 资源泄漏
  • 增加 verification timeout 的独立统计和日志展示
  • 恢复 CPU verify timeout 和 LLM fix timeout 的默认值
  • 约束 LLM 生成更精简的 run_model.py 修复脚本
  • graph_net/agent 下的中文代码注释改为英文

- CPU forward verification often takes 1000s+ for large models
- Update --verify-timeout help text accordingly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paddle-bot

paddle-bot Bot commented May 15, 2026

Copy link
Copy Markdown

Thanks for your contribution!

Xreki and others added 17 commits May 15, 2026 12:15
- LLMCodeFixer: support Optional[int] timeout, default 360s when None
- GraphNetAgent: add llm_timeout parameter (default: 600s)
- Remove download_timeout from previous iteration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise default llm_timeout from 600s to 900s to reduce ducc -p timeout failures.
- Treat forward verification timeout as pass for large models on CPU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ForwardVerifier now records last_timeout_success when eager forward
  passes are skipped due to subprocess timeout.
- GraphNetAgent propagates this flag via last_timeout_success attribute.
- parallel_extract worker reports timeout_success per model.
- PROGRESS line format: success=xx%(timeout_success=xx)%
- Summary and per-GPU stats also include timeout counts/rates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ProcessGroupTracker class to track active child process groups
  spawned by SubprocessGraphExtractor, enabling bulk kill via SIGKILL.
- Add orphan watcher daemon thread in parallel_extract worker_fn:
  detects when parent dies (ppid == 1) and kills all tracked child
  process groups, then exits via os._exit(1) to avoid Python-level
  cleanup delays that could block GPU memory release.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename exceptions to match subdirectory names:
  AnalysisError -> MetadataAnalysisError
  CodeGenError -> CodeGenerationError
  ExtractionError -> GraphExtractionError
  VerificationError -> SampleVerificationError
- Add error_category to exceptions with default_category class attrs
- Categorize errors at throw sites (404/403, config missing, script
  timeout, output missing, forward verify failed, etc.)
- Introduce GraphExtractionErrorClassifier for type-safe classification
- Smart LLM retry: only retry SCRIPT_EXECUTION_FAILED; skip retry for
  timeouts, model_not_found, model_forbidden, and LLM infra errors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move GraphExtractionErrorCategory from error_classifier.py to
  exceptions.py so type definitions live with the data they describe.
- Change default_category and error_category from raw strings to
  GraphExtractionErrorCategory enum values.
- Add missing categories: CONFIG_NOT_FOUND, CONFIG_PARSE_ERROR,
  METADATA_ANALYSIS_FAILED, VERIFICATION_FAILED.
- Update all raise-sites to pass enum members instead of strings.
- Remove redundant inline import in _is_llm_fixable_error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename markdown_report() to report_lines() and return List[str]
  instead of a markdown-formatted string.
- Remove markdown syntax (#, |, ---, etc.) for simpler consumption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Include ModelFetchError in explicit except clause so it returns
  EXTRACT_FAILED with proper classification instead of ERROR.
- Worker reads agent.error_classifier.get_record(model_id) after
  extract_sample() and forwards error_category + error_message in
  result_dict so the main process can see which stage failed.
- Fallback: if extract_sample itself raises unexpectedly, read
  error_category attribute from the raw exception.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep source comments and docstrings in English while preserving existing LLM prompt text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stop retrying LLM script fixes when a repaired script fails with a category that cannot be addressed by rewriting the script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set CPU forward verification timeout back to 600 seconds and keep CLI help text in sync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use HuggingFace API metadata lookup before download to skip clearly missing or forbidden model repos while preserving download retries for transient failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ask the LLM script fixer to emit a minimal complete run_model.py without comments, fallback logic, helpers, or unrelated validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Set the default LLM script-fix timeout back to 360 seconds and update the parameter documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Xreki Xreki changed the title Increase CPU verify_timeout default from 600s to 1200s. Improve the extraction agent reliability, observability, and retry behavior for large-scale HuggingFace model extraction May 19, 2026
@Xreki Xreki changed the title Improve the extraction agent reliability, observability, and retry behavior for large-scale HuggingFace model extraction Improve the extraction agent reliability, observability, and retry behavior May 19, 2026
Xreki and others added 2 commits May 19, 2026 13:45
Remove the unsupported repo_type argument from the model accessibility precheck so it works with the installed huggingface_hub version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Simplify generated run_model.py files and structure inputs as a dictionary literal to reduce retry prompt size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Xreki Xreki merged commit bc0a831 into PaddlePaddle:develop May 19, 2026
3 checks passed
@Xreki Xreki deleted the opt_extract_agent branch May 19, 2026 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants