fix llm_sim run_mode override and latency.csv e2e formula by huangxinV587 · Pull Request #202 · bytedance/xpu-perf

huangxinV587 · 2026-06-11T09:50:15Z

Summary

Two verified correctness bugs in projects/xpu_oj/llm_sim/engine.py, each a one-line fix.

1. `--run_mode decode` was silently ignored (engine.py:787)

parse_model() re-derived run_mode via kwargs.get("run_mode", "prefill"). But kwargs can never contain run_mode: it is a positional ctor arg (a keyword call binds to the parameter, not **kwargs), and the only real kwargs (disable_print_result, bench_stream_progress) are popped before parse_model(**kwargs) is called. So the engine always ran prefill:

op workloads got attn_mode="prefill" regardless of CLI flag
the prefill-only num_layers -= num_mirror_layers cut was applied to decode runs

Fix: fall back to the ctor-set self.run_mode (set from the CLI arg at engine.py:41) while keeping the explicit-kwargs override path.

2. `latency.csv` `e2e_latency` omitted the layer count (engine.py:382)

e2e_latency = stage_latency * pp_size where stage_latency is the one-layer DAG makespan, so the reported e2e was ~num_layersx too small and contradicted perf_info["e2e_latency"] (= total_latency * num_layers, engine.py:869).

Fix: e2e_latency = stage_latency * num_layers * pp_size (num_layers is the per-PP-stage layer count, same convention as line 869).

Verification

Offline targeted harness (bypasses bench server via __new__, monkeypatched report dir), plus py_compile:

decode: run_mode=decode, num_layers=64 (mirror layers no longer wrongly subtracted)
prefill: num_layers=62 (mirror subtraction logic unchanged)
latency.csv: stage=100.0, e2e=2000.0 = 100us x 10 layers x pp2

- parse_model() re-derived run_mode via kwargs.get("run_mode", "prefill"), but kwargs can never contain run_mode (it is a positional ctor arg and the only real kwargs are popped before the call), so --run_mode decode silently benched prefill workloads and applied the prefill-only mirror-layer cut. Fall back to the ctor-set self.run_mode instead. - latency.csv wrote e2e_latency = one-layer latency * pp_size, omitting the per-stage layer count. Now stage_latency * num_layers * pp_size, consistent with perf_info["e2e_latency"] (total_latency * num_layers).

CLAassistant · 2026-06-11T09:53:13Z

All committers have signed the CLA.

huangxinV587 mentioned this pull request Jun 11, 2026

fix llm_sim run_mode override and latency.csv e2e formula huangxinV587/xpu-perf#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix llm_sim run_mode override and latency.csv e2e formula#202

fix llm_sim run_mode override and latency.csv e2e formula#202
huangxinV587 wants to merge 1 commit into
bytedance:mainfrom
huangxinV587:fix/llm-sim-run-mode-e2e

huangxinV587 commented Jun 11, 2026

Uh oh!

CLAassistant commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

huangxinV587 commented Jun 11, 2026

Summary

1. --run_mode decode was silently ignored (engine.py:787)

2. latency.csv e2e_latency omitted the layer count (engine.py:382)

Verification

Uh oh!

CLAassistant commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `--run_mode decode` was silently ignored (engine.py:787)

2. `latency.csv` `e2e_latency` omitted the layer count (engine.py:382)

CLAassistant commented Jun 11, 2026 •

edited

Loading