Skip to content

refactor(profiler): add RAII step scope for service-wide timeline#1133

Open
psmarter wants to merge 1 commit into
mainfrom
feat/profiler-step-scope-v2
Open

refactor(profiler): add RAII step scope for service-wide timeline#1133
psmarter wants to merge 1 commit into
mainfrom
feat/profiler-step-scope-v2

Conversation

@psmarter

Copy link
Copy Markdown
Collaborator

Motivation

Service-wide timeline profiling should work consistently across engine types, including embedding workloads such as BERT. This PR also replaces the manual profiler tick() flow with an RAII step scope so the profiling step boundary is explicit and exception-safe.

Changes

  • Add StepWindowProfiler::StepScope to bracket one engine step.
    • Calls beginStep() on construction.
    • Calls endStep() on destruction.
    • Catches and logs endStep() exceptions without propagating them into the engine
      path.
  • Add StepWindowProfiler::configureFromConfig() for service-wide timeline setup from
    ProfilingDebugLoggingConfig.
  • Wire service-wide timeline profiling into:
    • NormalEngine
    • EmbeddingEngine
  • Add service-wide timeline config fields:
    • --gen_timeline_start_step / GEN_TIMELINE_START_STEP
    • --gen_timeline_num_steps / GEN_TIMELINE_NUM_STEPS
    • --gen_timeline_trace_name / GEN_TIMELINE_TRACE_NAME
  • Keep pickle backward compatibility for ProfilingDebugLoggingConfig.
    • Old 12-field tuple is still accepted.
    • New 15-field tuple includes the timeline window fields.
  • Update profiling docs to describe service-wide timeline as a step-window trace.

Validation

  • git diff --check origin/main..HEAD
  • python3 -m py_compile rtp_llm/server/server_args/profile_debug_logging_group_args.py
  • bazelisk build //rtp_llm/cpp/engine_base:profiler //rtp_llm/cpp/normal_engine:normal_engine//rtp_llm/cpp/embedding_engine:embedding_engine --config=rocm

Build result:

INFO: Build completed successfully

@psmarter psmarter requested a review from LLLLKKKK as a code owner June 23, 2026 09:40
Copilot AI review requested due to automatic review settings June 23, 2026 09:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1133

Status: LGTM

Summary: P0/0 · P1/0 · P2/1 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

  • 新增 StepScope RAII 模式和配置字段缺少测试覆盖 @ rtp_llm/cpp/engine_base/TorchProfiler.h:81
    • 建议:建议至少添加:(1) StepScope RAII 构造/析构正确调用 beginStep/endStep 的验证;(2) step 计数达到 num_steps 后 enabled_ 自动清除的测试;(3) configureFromConfig 在 gen_timeline_sync=false 时 no-op 的单元测试;(4) endStep 中 disable 路径调用 stopProfiler;(5) reconfigure 重置 waited_steps/profiled_steps;(6) pickle round-trip 测试覆盖 size=12 和 size=15 两种格式。可用 mock TorchProfile 避免 Kineto 依赖。

Checklist Violations (1 fail / 72 total)

General Principles Checklist

  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增 StepScope RAII 模式和配置字段缺少测试覆盖
    StepScope RAII guard、configureFromConfig()、beginStep/endStep 状态机均无测试。grep 整个 test/ 目录无 StepWindowProfiler 引用。新增三个 config 字段的传播链也缺少覆盖。

Strengths

  • RAII StepScope 从根本上消除了手动 tick() 的不对称调用风险——旧代码 EmbeddingEngine 只调一次 tick()、NormalEngine 调两次,语义不一致;新 RAII 统一了两个引擎的 profiler 生命周期
  • StepScope 析构函数正确捕获 std::exception 和 catch-all 两级异常并记录 ERROR 日志,符合 C++ 析构不应抛异常的最佳实践
  • pickle 反序列化通过 tuple size 分支(12/15)实现向后兼容,旧格式使用 C++ struct 默认值,平滑升级
  • configureFromConfig() 统一了两个引擎的配置入口,消除了 gen_timeline_sync 检查的重复代码

},
[](py::tuple t) {
if (t.size() != 12)
if (t.size() != 12 && t.size() != 15)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不需要的

@LLLLKKKK LLLLKKKK enabled auto-merge (rebase) June 26, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants