Skip to content

fix llm_sim run_mode override and latency.csv e2e formula#202

Open
huangxinV587 wants to merge 1 commit into
bytedance:mainfrom
huangxinV587:fix/llm-sim-run-mode-e2e
Open

fix llm_sim run_mode override and latency.csv e2e formula#202
huangxinV587 wants to merge 1 commit into
bytedance:mainfrom
huangxinV587:fix/llm-sim-run-mode-e2e

Conversation

@huangxinV587

Copy link
Copy Markdown

Summary

Two verified correctness bugs in projects/xpu_oj/llm_sim/engine.py, each a one-line fix.

1. --run_mode decode was silently ignored (engine.py:787)

parse_model() re-derived run_mode via kwargs.get("run_mode", "prefill"). But kwargs can never contain run_mode: it is a positional ctor arg (a keyword call binds to the parameter, not **kwargs), and the only real kwargs (disable_print_result, bench_stream_progress) are popped before parse_model(**kwargs) is called. So the engine always ran prefill:

  • op workloads got attn_mode="prefill" regardless of CLI flag
  • the prefill-only num_layers -= num_mirror_layers cut was applied to decode runs

Fix: fall back to the ctor-set self.run_mode (set from the CLI arg at engine.py:41) while keeping the explicit-kwargs override path.

2. latency.csv e2e_latency omitted the layer count (engine.py:382)

e2e_latency = stage_latency * pp_size where stage_latency is the one-layer DAG makespan, so the reported e2e was ~num_layersx too small and contradicted perf_info["e2e_latency"] (= total_latency * num_layers, engine.py:869).

Fix: e2e_latency = stage_latency * num_layers * pp_size (num_layers is the per-PP-stage layer count, same convention as line 869).

Verification

Offline targeted harness (bypasses bench server via __new__, monkeypatched report dir), plus py_compile:

  • decode: run_mode=decode, num_layers=64 (mirror layers no longer wrongly subtracted)
  • prefill: num_layers=62 (mirror subtraction logic unchanged)
  • latency.csv: stage=100.0, e2e=2000.0 = 100us x 10 layers x pp2

- parse_model() re-derived run_mode via kwargs.get("run_mode", "prefill"),
  but kwargs can never contain run_mode (it is a positional ctor arg and the
  only real kwargs are popped before the call), so --run_mode decode silently
  benched prefill workloads and applied the prefill-only mirror-layer cut.
  Fall back to the ctor-set self.run_mode instead.
- latency.csv wrote e2e_latency = one-layer latency * pp_size, omitting the
  per-stage layer count. Now stage_latency * num_layers * pp_size, consistent
  with perf_info["e2e_latency"] (total_latency * num_layers).
@CLAassistant

CLAassistant commented Jun 11, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants