Skip to content

mock: cache-aware prefill/decode latency so disaggregated TTFT/TPS tests are measurable#4246

Draft
ddjukicTT wants to merge 11 commits into
mainfrom
ddjukic/prefill-functional-requirements-test
Draft

mock: cache-aware prefill/decode latency so disaggregated TTFT/TPS tests are measurable#4246
ddjukicTT wants to merge 11 commits into
mainfrom
ddjukic/prefill-functional-requirements-test

Conversation

@ddjukicTT

@ddjukicTT ddjukicTT commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Make the prefill/decode disaggregation smoke suite runnable end-to-end from the top-level run.py, against a self-contained mock stack that's faithful enough to measure prefix-cache behavior

Commands:

python run.py --workflow prefill_decode --served-model moonshotai/Kimi-K2.6      # non-catalog
python run.py --model DeepSeek-R1-0528 --workflow prefill_decode --device galaxy  # catalog

test_07 fails on the mock simulator: the prefix cache is correct , but its TTFT-ratio assertion trips because simulator TTFT (~40 ms) is dominated by prompt transport/tokenization, not on-device prefill. Sim artifact, not a regression; tunable via TTFT_MEANINGFUL_S / TTFT_HIT_MAX_FRACTION.

Comment thread tt-media-server/cpp_server/src/runtime/runners/llm_runner.cpp
Comment thread tt-media-server/cpp_server/src/runtime/runners/llm_runner.cpp Outdated
echo "DYNAMO_ENDPOINT_NAME=generate"
}

start_frontend() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run deploy.sh instead of run_stack.sh?

// active sequence and takes MOCK_DECODE_SLEEP_US, modeling inter-token latency
// so the decode tokens-per-second (TPS ≈ 1e6 / MOCK_DECODE_SLEEP_US) is
// measurable on the mock. Default 0 (tokens emitted as fast as the loop runs).
std::chrono::microseconds mockDecodeDelay() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should think switching to docker-compose to deploy multiple related docker containers instead of multiple bash scripts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants