fix llm_sim run_mode override and latency.csv e2e formula#202
Open
huangxinV587 wants to merge 1 commit into
Open
fix llm_sim run_mode override and latency.csv e2e formula#202huangxinV587 wants to merge 1 commit into
huangxinV587 wants to merge 1 commit into
Conversation
- parse_model() re-derived run_mode via kwargs.get("run_mode", "prefill"),
but kwargs can never contain run_mode (it is a positional ctor arg and the
only real kwargs are popped before the call), so --run_mode decode silently
benched prefill workloads and applied the prefill-only mirror-layer cut.
Fall back to the ctor-set self.run_mode instead.
- latency.csv wrote e2e_latency = one-layer latency * pp_size, omitting the
per-stage layer count. Now stage_latency * num_layers * pp_size, consistent
with perf_info["e2e_latency"] (total_latency * num_layers).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two verified correctness bugs in
projects/xpu_oj/llm_sim/engine.py, each a one-line fix.1.
--run_mode decodewas silently ignored (engine.py:787)parse_model()re-derivedrun_modeviakwargs.get("run_mode", "prefill"). Butkwargscan never containrun_mode: it is a positional ctor arg (a keyword call binds to the parameter, not**kwargs), and the only real kwargs (disable_print_result,bench_stream_progress) are popped beforeparse_model(**kwargs)is called. So the engine always ranprefill:attn_mode="prefill"regardless of CLI flagnum_layers -= num_mirror_layerscut was applied to decode runsFix: fall back to the ctor-set
self.run_mode(set from the CLI arg at engine.py:41) while keeping the explicit-kwargs override path.2.
latency.csve2e_latencyomitted the layer count (engine.py:382)e2e_latency = stage_latency * pp_sizewherestage_latencyis the one-layer DAG makespan, so the reported e2e was ~num_layersx too small and contradictedperf_info["e2e_latency"](=total_latency * num_layers, engine.py:869).Fix:
e2e_latency = stage_latency * num_layers * pp_size(num_layersis the per-PP-stage layer count, same convention as line 869).Verification
Offline targeted harness (bypasses bench server via
__new__, monkeypatched report dir), pluspy_compile:run_mode=decode,num_layers=64(mirror layers no longer wrongly subtracted)num_layers=62(mirror subtraction logic unchanged)stage=100.0,e2e=2000.0= 100us x 10 layers x pp2