Skip to content

Feat/model inputs dump#1118

Open
JackTan25 wants to merge 2 commits into
mainfrom
feat/model_inputs_dump
Open

Feat/model inputs dump#1118
JackTan25 wants to merge 2 commits into
mainfrom
feat/model_inputs_dump

Conversation

@JackTan25

Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings June 18, 2026 08:23
@JackTan25 JackTan25 requested a review from LLLLKKKK as a code owner June 18, 2026 08:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds an opt-in debug feature to dump model inputs into rotating log files to aid troubleshooting and profiling.

Changes:

  • Introduces enable_model_inputs_log config/CLI/env flag and propagates it through executors and gatherers.
  • Extends GptModelInputs with extra tensors for logging snapshots and wires a new ModelInputsLogger into PyWrappedModel::forward.
  • Implements ModelInputsLogger with file rotation, periodic flush, and optional metrics reporting.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rtp_llm/server/server_args/profile_debug_logging_group_args.py Adds CLI/env switch to enable model input logging.
rtp_llm/cpp/config/ConfigModules.h Adds enable_model_inputs_log to profiling debug config.
rtp_llm/cpp/config/ConfigModules.cc Prints the new config field in to_string().
rtp_llm/cpp/pybind/ConfigInit.cc Exposes the new config field to Python + extends pickle state.
rtp_llm/models_py/bindings/core/OpData.h Adds host snapshot tensors to GptModelInputs for logging.
rtp_llm/cpp/normal_engine/NormalModelInputGatherer.h Adds gatherer flag to control host snapshots for logging.
rtp_llm/cpp/normal_engine/NormalModelInputGatherer.cc Populates logging snapshot tensors when enabled.
rtp_llm/cpp/normal_engine/NormalBatchStreamProcessor.cc Passes the new flag into the input gatherer config.
rtp_llm/cpp/normal_engine/NormalExecutor.h Stores a shared ModelInputsLogger in the executor.
rtp_llm/cpp/normal_engine/NormalExecutor.cc Conditionally constructs/injects ModelInputsLogger into model.
rtp_llm/cpp/normal_engine/speculative/MtpExecutor.h Stores a shared ModelInputsLogger for speculative executor.
rtp_llm/cpp/normal_engine/speculative/MtpExecutor.cc Conditionally constructs/injects ModelInputsLogger into models.
rtp_llm/cpp/models/PyWrappedModel.h Adds logger dependency to constructor and stores it.
rtp_llm/cpp/models/PyWrappedModel.cc Emits model inputs logs at start of forward().
rtp_llm/cpp/models/ModelInputsLogger.h Adds new logger class interface.
rtp_llm/cpp/models/ModelInputsLogger.cc Implements JSONL logging + rotation + metrics.
rtp_llm/cpp/models/BUILD Adds build deps needed by ModelInputsLogger.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +229 to +240
output_.open(file_path_, std::ios::out | std::ios::trunc);
bytes_ = 0;
}

std::mutex mutex_;
std::filesystem::path file_path_;
std::ofstream output_;
int backup_count_ = 0;
size_t bytes_ = 0;
size_t pending_lines_ = 0;
int64_t last_flush_us_ = 0;
bool valid_ = false;
Comment on lines +31 to +56
std::string jsonEscape(const std::string& input) {
std::ostringstream os;
for (unsigned char c : input) {
switch (c) {
case '\\':
os << "\\\\";
break;
case '"':
os << "\\\"";
break;
case '\n':
os << "\\n";
break;
case '\r':
os << "\\r";
break;
case '\t':
os << "\\t";
break;
default:
os << static_cast<char>(c);
break;
}
}
return os.str();
}
Comment on lines +58 to +65
std::string combineStringsForLog(const std::vector<std::string>& vec) {
std::string result = "\" ";
for (const auto& s : vec) {
result += s + ", ";
}
result += "\"";
return result;
}
Comment on lines +99 to +106
profile_debug_logging_group.add_argument(
"--enable_model_inputs_log",
env_name="ENABLE_MODEL_INPUTS_LOG",
bind_to=(profiling_debug_config, "enable_model_inputs_log"),
type=str2bool,
default=False,
help="控制是否打印模型输入日志。可选值: True (启用), False (禁用)。默认为 False",
)
Comment on lines +42 to +45
torch::Tensor combo_tokens_host_for_log;
torch::Tensor input_lengths_host_for_log;
torch::Tensor sequence_lengths_host_for_log;
torch::Tensor prefix_lengths_host_for_log;
@JackTan25 JackTan25 force-pushed the feat/model_inputs_dump branch from 6eab357 to c8deafc Compare June 18, 2026 08:39
@LLLLKKKK

Copy link
Copy Markdown
Collaborator

AI Code Review - PR #1118

Status: LGTM

Summary: P0/0 · P1/0 · P2/3 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

  • MTP 路径会记录过期的模型输入快照 @ rtp_llm/cpp/models/ModelInputsLogger.cc:132
    • 建议:在每次修改 GptModelInputs 后同步更新或清空 *_host_for_log,或仅在快照与当前 tensor 同源时使用快照。
  • 新增模型输入日志链路缺少测试覆盖 @ rtp_llm/cpp/models/ModelInputsLogger.cc:248
    • 建议:补充 logger 单测或 executor/gatherer 测试,覆盖普通 decode/prefill、MTP 修改输入后的日志内容和默认关闭路径。
  • decode 日志的顶层 id 恒为 -1 @ rtp_llm/cpp/models/ModelInputsLogger.cc:100
    • 建议:为 decode batch 填充 request_id,或在 request_id 为空时退化使用 trace_ids/stream id 生成可关联的日志 id。

Checklist Violations (6 fail / 78 total)

General Principles Checklist

  • [6.1] Architecture — 状态不变量:创建/更新/失败/重试/回滚路径有效 → issue MTP 路径会记录过期的模型输入快照
    _*host_for_log 快照在 gather 后设置,但 MTP 后续会改写 GptModelInputs 再 forward,日志状态会与实际输入不一致。
  • [6.1] Architecture — 可观测性:日志/指标/超时可操作、非噪声 → issue decode 日志的顶层 id 恒为 -1
    ModelInputsLogger.cc 的 firstRequestId 只读 request_id,decode-only batch 为空时顶层 id 为 -1,削弱请求关联。
  • [6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 新增模型输入日志链路缺少测试覆盖
    diff_paths 没有测试文件,新增日志格式、开关传播和 MTP 快照一致性未被自动覆盖。
  • [6.1] Tests — 边界 case 覆盖(空、单元素、最大值) → issue 新增模型输入日志链路缺少测试覆盖
    日志轮转、空 tensor、CUDA tensor、长输入截断和 MTP 多次 forward 等边界未见测试覆盖。
  • [6.1] Tests — 分布式/跨平台变更有对应覆盖 → checklist-only
    新增日志会在 TP rank 上执行,但功能默认关闭;建议后续覆盖多 rank dump 一致性,暂不单独升级为 issue。

RTP-LLM Checklist

  • [H] 测试与 CI — 测试覆盖充分:大重构等价覆盖,新功能端到端测试 → issue 新增模型输入日志链路缺少测试覆盖
    新增模型输入日志功能无对应测试文件。

Strengths

  • 新增开关默认关闭,且 ProfilingDebugLoggingConfig 的 pybind pickle 对旧 12 元组保留兼容。
  • 日志 writer 有大小轮转、批量 flush 和耗时 metric,便于控制排查功能的运行成本。

return pos == std::string::npos ? tensor_with_data : tensor_with_data.substr(pos + 2);
}

std::string tensorLogString(const torch::Tensor& tensor, const torch::Tensor& host_snapshot = {}) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

搞这些干啥?直接 torch dump 就行了

@@ -0,0 +1,271 @@
#include "rtp_llm/cpp/models/ModelInputsLogger.h"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replay 也要做了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants