Feat/model inputs dump by JackTan25 · Pull Request #1118 · alibaba/rtp-llm

JackTan25 · 2026-06-18T08:23:31Z

No description provided.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds an opt-in debug feature to dump model inputs into rotating log files to aid troubleshooting and profiling.

Changes:

Introduces enable_model_inputs_log config/CLI/env flag and propagates it through executors and gatherers.
Extends GptModelInputs with extra tensors for logging snapshots and wires a new ModelInputsLogger into PyWrappedModel::forward.
Implements ModelInputsLogger with file rotation, periodic flush, and optional metrics reporting.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
rtp_llm/server/server_args/profile_debug_logging_group_args.py	Adds CLI/env switch to enable model input logging.
rtp_llm/cpp/config/ConfigModules.h	Adds `enable_model_inputs_log` to profiling debug config.
rtp_llm/cpp/config/ConfigModules.cc	Prints the new config field in `to_string()`.
rtp_llm/cpp/pybind/ConfigInit.cc	Exposes the new config field to Python + extends pickle state.
rtp_llm/models_py/bindings/core/OpData.h	Adds host snapshot tensors to `GptModelInputs` for logging.
rtp_llm/cpp/normal_engine/NormalModelInputGatherer.h	Adds gatherer flag to control host snapshots for logging.
rtp_llm/cpp/normal_engine/NormalModelInputGatherer.cc	Populates logging snapshot tensors when enabled.
rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc	Passes the new flag into the input gatherer config.
rtp_llm/cpp/normal_engine/NormalExecutor.h	Stores a shared `ModelInputsLogger` in the executor.
rtp_llm/cpp/normal_engine/NormalExecutor.cc	Conditionally constructs/injects `ModelInputsLogger` into model.
rtp_llm/cpp/normal_engine/speculative/MtpExecutor.h	Stores a shared `ModelInputsLogger` for speculative executor.
rtp_llm/cpp/normal_engine/speculative/MtpExecutor.cc	Conditionally constructs/injects `ModelInputsLogger` into models.
rtp_llm/cpp/models/PyWrappedModel.h	Adds logger dependency to constructor and stores it.
rtp_llm/cpp/models/PyWrappedModel.cc	Emits model inputs logs at start of `forward()`.
rtp_llm/cpp/models/ModelInputsLogger.h	Adds new logger class interface.
rtp_llm/cpp/models/ModelInputsLogger.cc	Implements JSONL logging + rotation + metrics.
rtp_llm/cpp/models/BUILD	Adds build deps needed by `ModelInputsLogger`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        output_.open(file_path_, std::ios::out | std::ios::trunc);
+        bytes_ = 0;
+    }
+
+    std::mutex            mutex_;
+    std::filesystem::path file_path_;
+    std::ofstream         output_;
+    int                   backup_count_  = 0;
+    size_t                bytes_         = 0;
+    size_t                pending_lines_ = 0;
+    int64_t               last_flush_us_ = 0;
+    bool                  valid_         = false;


+std::string jsonEscape(const std::string& input) {
+    std::ostringstream os;
+    for (unsigned char c : input) {
+        switch (c) {
+            case '\\':
+                os << "\\\\";
+                break;
+            case '"':
+                os << "\\\"";
+                break;
+            case '\n':
+                os << "\\n";
+                break;
+            case '\r':
+                os << "\\r";
+                break;
+            case '\t':
+                os << "\\t";
+                break;
+            default:
+                os << static_cast<char>(c);
+                break;
+        }
+    }
+    return os.str();
+}


+std::string combineStringsForLog(const std::vector<std::string>& vec) {
+    std::string result = "\" ";
+    for (const auto& s : vec) {
+        result += s + ", ";
+    }
+    result += "\"";
+    return result;
+}


+    profile_debug_logging_group.add_argument(
+        "--enable_model_inputs_log",
+        env_name="ENABLE_MODEL_INPUTS_LOG",
+        bind_to=(profiling_debug_config, "enable_model_inputs_log"),
+        type=str2bool,
+        default=False,
+        help="控制是否打印模型输入日志。可选值: True (启用), False (禁用)。默认为 False",
+    )


+    torch::Tensor         combo_tokens_host_for_log;
+    torch::Tensor         input_lengths_host_for_log;
+    torch::Tensor         sequence_lengths_host_for_log;
+    torch::Tensor         prefix_lengths_host_for_log;


LLLLKKKK · 2026-06-18T08:50:45Z

AI Code Review - PR #1118

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

MTP 路径会记录过期的模型输入快照 @ rtp_llm/cpp/models/ModelInputsLogger.cc:132
- 建议：在每次修改 GptModelInputs 后同步更新或清空 *_host_for_log，或仅在快照与当前 tensor 同源时使用快照。
新增模型输入日志链路缺少测试覆盖 @ rtp_llm/cpp/models/ModelInputsLogger.cc:248
- 建议：补充 logger 单测或 executor/gatherer 测试，覆盖普通 decode/prefill、MTP 修改输入后的日志内容和默认关闭路径。
decode 日志的顶层 id 恒为 -1 @ rtp_llm/cpp/models/ModelInputsLogger.cc:100
- 建议：为 decode batch 填充 request_id，或在 request_id 为空时退化使用 trace_ids/stream id 生成可关联的日志 id。

Checklist Violations (6 fail / 78 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue MTP 路径会记录过期的模型输入快照
_*host_for_log 快照在 gather 后设置，但 MTP 后续会改写 GptModelInputs 再 forward，日志状态会与实际输入不一致。
[6.1] Architecture — 可观测性：日志/指标/超时可操作、非噪声 → issue decode 日志的顶层 id 恒为 -1
ModelInputsLogger.cc 的 firstRequestId 只读 request_id，decode-only batch 为空时顶层 id 为 -1，削弱请求关联。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增模型输入日志链路缺少测试覆盖
diff_paths 没有测试文件，新增日志格式、开关传播和 MTP 快照一致性未被自动覆盖。
[6.1] Tests — 边界 case 覆盖（空、单元素、最大值） → issue 新增模型输入日志链路缺少测试覆盖
日志轮转、空 tensor、CUDA tensor、长输入截断和 MTP 多次 forward 等边界未见测试覆盖。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → checklist-only
新增日志会在 TP rank 上执行，但功能默认关闭；建议后续覆盖多 rank dump 一致性，暂不单独升级为 issue。

RTP-LLM Checklist

[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 新增模型输入日志链路缺少测试覆盖
新增模型输入日志功能无对应测试文件。

Strengths

新增开关默认关闭，且 ProfilingDebugLoggingConfig 的 pybind pickle 对旧 12 元组保留兼容。
日志 writer 有大小轮转、批量 flush 和耗时 metric，便于控制排查功能的运行成本。

LLLLKKKK · 2026-06-22T02:56:25Z

+    return pos == std::string::npos ? tensor_with_data : tensor_with_data.substr(pos + 2);
+}
+
+std::string tensorLogString(const torch::Tensor& tensor, const torch::Tensor& host_snapshot = {}) {


搞这些干啥？直接 torch dump 就行了

LLLLKKKK · 2026-06-22T03:02:10Z

@@ -0,0 +1,271 @@
+#include "rtp_llm/cpp/models/ModelInputsLogger.h"


replay 也要做了

Add model inputs logging

7affcb5

Copilot AI review requested due to automatic review settings June 18, 2026 08:23

JackTan25 requested a review from LLLLKKKK as a code owner June 18, 2026 08:23

Copilot AI reviewed Jun 18, 2026

View reviewed changes

refactor: centralize model inputs logging

c8deafc

JackTan25 force-pushed the feat/model_inputs_dump branch from 6eab357 to c8deafc Compare June 18, 2026 08:39

Copilot started reviewing on behalf of JackTan25 June 18, 2026 22:21 View session

LLLLKKKK requested changes Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/model inputs dump#1118

Feat/model inputs dump#1118
JackTan25 wants to merge 2 commits into
mainfrom
feat/model_inputs_dump

JackTan25 commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

LLLLKKKK commented Jun 18, 2026

Uh oh!

LLLLKKKK Jun 22, 2026

Uh oh!

LLLLKKKK Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,271 @@
		#include "rtp_llm/cpp/models/ModelInputsLogger.h"

Uh oh!

Conversation

JackTan25 commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

LLLLKKKK commented Jun 18, 2026

AI Code Review - PR #1118

Non-blocking Suggestions

P2

Checklist Violations (6 fail / 78 total)

Strengths

Uh oh!

LLLLKKKK Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

LLLLKKKK Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants