This project provides tools to estimate and benchmark the theoretical inference performance of large language models (LLMs) on high-end NVIDIA GPUs. It consists of two scripts:
llm_perf_estimator.py
: Estimates per-query and per-token metrics for a single LLM configuration.grid_search_runner.py
: Automates grid search across multiple GPUs and model sizes, producing comprehensive Markdown reports.
pip install pandas
Estimate performance for a single model/GPU configuration:
python llm_perf_estimator.py \
--model_size_b 70 \
--precision FP8 \
--gpu_type H200 \
--num_gpus 1 \
--input_tokens 512 \
--output_tokens 512 \
--batch_size 8 \
--cache_ratio 0.05 \
--mfu 0.15
This prints:
- Prefill throughput (tokens/sec, compute-bound with FLOPs utilization)
- Decode throughput (tokens/sec, memory-bound)
- Prefill & Decode times (sec)
- Latency per query (sec)
Perform a grid search over GPU types and model sizes, saving results to Markdown:
python grid_search_runner.py \
--precision FP8 \
--input_tokens 512 \
--output_tokens 512 \
--batch_size 1 \
--cache_ratio 0.05 \
--mfu 0.15 \
--num_gpus 1 \
--output llm_perf_grid.md
The generated llm_perf_grid.md
will include:
-
The command used (in a code block)
-
Four tables for:
- Prefill Throughput
- Decode Throughput
- Aggregate Decode Throughput
- Latency per Query
The script models two distinct phases of autoregressive LLM inference:
- Prefill (prompt encoding): Processes the entire input prompt in one batched forward pass. This phase is compute-bound, since the GPU performs large matrix‐matrix operations (high arithmetic intensity).
- Decode (token generation): Generates new tokens one at a time (or in micro‐batches). This phase is memory-bound, driven by streaming weights and accessing the growing KV‐cache from GPU memory.
- Prefill throughput
Tokens/sec during prompt encoding. - Decode throughput
Tokens/sec during generation. - Latency per query
Seconds to process one query (input + output).
Argument | Description |
---|---|
--model_size_b |
Model size in billions of parameters (e.g., 70 for 70B). |
--precision |
Numeric precision: FP16 , FP8 , or FP4 . |
--gpu_type |
GPU hardware: H100 , H200 , or B200 . |
--num_gpus |
Number of GPUs to scale bandwidth & compute. |
--input_tokens |
Number of prompt tokens per query. |
--output_tokens |
Number of tokens to generate per query. |
--batch_size |
Concurrent sequences in one batch. |
--cache_ratio |
KV-cache total size as a fraction of model size (e.g., 0.05 ). |
--mfu |
FLOPs utilization fraction for prefill (e.g., 0.15 for 15%). |
FLOPS_PER_PARAM = 6.0
Approximate FLOPs per parameter per token for a forward pass.GPU_BANDWIDTH
H100: 3.35 TB/s, H200: 4.8 TB/s, B200: 8 TB/s.GPU_PEAK_FLOPS
H100: ~2 PFLOPS (FP16), H200: ~3 PFLOPS, B200: ~5 PFLOPS.PRECISION_BYTES
FP16: 2 B/param, FP8: 1 B/param, FP4: 0.5 B/param.
python grid_search_runner.py --precision FP8 --input_tokens 512 --output_tokens 512 --batch_size 16 --cache_ratio 0.05 --mfu 0.15 --num_gpus 1 --output llm_perf_grid_bs16_1x.md
GPU | 8B | 13B | 32B | 70B | 405B | 671B |
---|---|---|---|---|---|---|
H100 | 50000.0 | 30769.23 | 12500.0 | 5714.29 | 987.65 | 596.13 |
H200 | 75000.0 | 46153.85 | 18750.0 | 8571.43 | 1481.48 | 894.19 |
B200 | 125000.0 | 76923.08 | 31250.0 | 14285.71 | 2469.14 | 1490.31 |
GPU | 8B | 13B | 32B | 70B | 405B | 671B |
---|---|---|---|---|---|---|
H100 | 29777.78 | 18324.79 | 7444.44 | 3403.17 | 588.2 | 355.03 |
H200 | 42666.67 | 26256.41 | 10666.67 | 4876.19 | 842.8 | 508.69 |
B200 | 71111.11 | 43760.68 | 17777.78 | 8126.98 | 1404.66 | 847.82 |
GPU | 8B | 13B | 32B | 70B | 405B | 671B |
---|---|---|---|---|---|---|
H100 | 0.439 | 0.713 | 1.756 | 3.841 | 22.222 | 36.816 |
H200 | 0.301 | 0.489 | 1.205 | 2.636 | 15.25 | 25.265 |
B200 | 0.181 | 0.294 | 0.723 | 1.581 | 9.15 | 15.159 |
python grid_search_runner.py --precision FP8 --input_tokens 512 --output_tokens 512 --batch_size 16 --cache_ratio 0.05 --mfu 0.15 --num_gpus 8 --output llm_perf_grid_bs16_8x.md
GPU | 8B | 13B | 32B | 70B | 405B | 671B |
---|---|---|---|---|---|---|
H100 | 50000.0 | 30769.23 | 12500.0 | 5714.29 | 987.65 | 596.13 |
H200 | 75000.0 | 46153.85 | 18750.0 | 8571.43 | 1481.48 | 894.19 |
B200 | 125000.0 | 76923.08 | 31250.0 | 14285.71 | 2469.14 | 1490.31 |
GPU | 8B | 13B | 32B | 70B | 405B | 671B |
---|---|---|---|---|---|---|
H100 | 29777.78 | 18324.79 | 7444.44 | 3403.17 | 588.2 | 355.03 |
H200 | 42666.67 | 26256.41 | 10666.67 | 4876.19 | 842.8 | 508.69 |
B200 | 71111.11 | 43760.68 | 17777.78 | 8126.98 | 1404.66 | 847.82 |
GPU | 8B | 13B | 32B | 70B | 405B | 671B |
---|---|---|---|---|---|---|
H100 | 0.439 | 0.713 | 1.756 | 3.841 | 22.222 | 36.816 |
H200 | 0.301 | 0.489 | 1.205 | 2.636 | 15.25 | 25.265 |
B200 | 0.181 | 0.294 | 0.723 | 1.581 | 9.15 | 15.159 |