A production-oriented kernel / operator tracing framework for large language models. It supports HuggingFace Transformers and vLLM, runs tracers in isolated subprocesses, and is designed for batch model evaluation at scale.
- Multiple tracing backends
- torch.profiler
- torch dispatch / dynamo / FX / Inductor
- vLLM built-in profiler
- Unified tracer abstraction (Tracer class + registry)
- Subprocess isolation (OOM-safe, crash-safe)
- Concurrent, de-duplicated model download
- Model skip-list support
- Intelligent bypass if trace already exists
- Built-in support for HF mirror & offline mode
- Works with both HF eager execution and vLLM engines
pip install -e .With vLLM support:
pip install -e .[vllm]Requirements:
- Python ≥ 3.10
- PyTorch ≥ 2.2
- transformers ≥ 4.40.0
- vllm ≥ 0.11.0 (optional, only if tracing vLLM)
python main.py --list-tracerspython main.py \
--framework huggingface \
--model meta-llama/Llama-2-7b-hf \
--output-dir outputs/llama2python main.py \
--framework vllm \
--model meta-llama/Llama-2-7b-hf \
--output-dir outputs/llama2python main.py \
--framework huggingface \
--model meta-llama/Llama-2-7b-hf \
--tracers torch_profiler \
--output-dir outputs/llama2python main.py \
--framework vllm \
--model meta-llama/Llama-2-7b-hf \
--tracers vllm_profiler \
--output-dir outputs/llama2python main.py \
--framework huggingface \
--model meta-llama/Llama-2-7b-hf \
--mode train \
--output-dir outputs/llama2python main.py \
--framework vllm \
--model meta-llama/Llama-2-7b-hf \
--mode eval \
--output-dir outputs/llama2python core/worker.py \
--framework huggingface \
--model-path /path/to/model \
--tracer torch_profiler \
--mode eval
--output-dir /path/to/outputpython core/worker.py \
--framework vllm \
--model-path /path/to/model \
--tracer vllm_profiler \
--mode eval
--output-dir /path/to/outputpython main.py \
--framework huggingface \
--model-list config/models.txt \
--skip-model-list config/skip.txt \
--output-dir outputspython main.py \
--framework vllm \
--model-list config/models.txt \
--skip-model-list config/skip.txt \
--output-dir outputspython main.py \
--framework huggingface \
--model-list config/models.txt \
--tracers torch_profiler \
--skip-model-list config/skip.txt \
--output-dir outputspython main.py \
--framework vllm \
--model-list config/models.txt \
--tracers vllm_profiler \
--skip-model-list config/skip.txt \
--output-dir outputspython main.py \
--framework huggingface \
--model-list config/models.txt \
--mode train \
--skip-model-list config/skip.txt \
--output-dir outputspython main.py \
--framework vllm \
--model-list config/models.txt \
--mode eval \
--skip-model-list config/skip.txt \
--output-dir outputspython main.py \
--framework huggingface \
--model meta-llama/Llama-2-7b-hf \
--forcepython main.py \
--framework vllm \
--model meta-llama/Llama-2-7b-hf \
--forcepython main.py \
--framework huggingface \
--tracers torch_profiler \
--model meta-llama/Llama-2-7b-hf \
--forcepython main.py \
--framework vllm \
--tracers vllm_profiler \
--model meta-llama/Llama-2-7b-hf \
--forcepython main.py \
--framework huggingface \
--mode train \
--model meta-llama/Llama-2-7b-hf \
--forcepython main.py \
--framework vllm \
--mode eval \
--model meta-llama/Llama-2-7b-hf \
--force- Torch profiler & dynamo are stateful
- vLLM may hang or OOM
- Subprocess isolation guarantees:
- clean CUDA state
- safe timeout
- partial failure tolerance
- Execution unit:
nn.Module - Input: tokenized tensors
- Tracing scope:
- aten ops
- FX graphs
- torch.profiler events
- torch.inductor IR
- torch.dispatch
- Failure mode:
- Missing shards → exception (fast fail)
- Execution unit:
LLMEngine - Input: prompt strings
- Tracing scope:
- CUDA kernels inside engine
- attention / KV cache behavior
- Failure mode:
- Partial cache → hang or silent stall
Therefore this project enforces strict model cache validation before tracing.
outputs/
llama2/
torch_profiler/
torch_profile.json
torch_kernel_profile.json
...
print_fx/
print_fx.json
Each tracer owns one directory, and success is marked by:
_TRACE_STATUS.json
Used for bypass & resumability.
A: Torch dynamo / profiler conflict with each other and leak global state.
A: Yes. Download failures are detected early and skipped safely.
A: vLLM may hang if model type or shards are incomplete.
A:
class MyTracer(Tracer):
name = "my_tracer"
exclusive_group = None
def run(self, ctx, output_dir):
...Register it in tracers/registry.py.
MIT