Skip to content

ansjindal/genai-perf-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenAI Performance Analyzer

A streamlined tool for analyzing and visualizing performance metrics of Large Language Models (LLMs) across different configurations and runtime environments. Compatible with performance data generated by NVIDIA's GenAI-Perf tool.

Features

  • Interactive Visualization: Dynamic plots with:
    • Latency distributions shown as box plots with throughput on x-axis
    • Performance metrics plotted against request throughput
    • Statistical indicators (mean, quartiles, P90, P99)
  • Configuration Comparison: Compare different model configurations side by side with:
    • Overlaid box plots for latency distributions
    • Trend lines showing metric variations with throughput
    • Color-coded model configurations for easy differentiation
  • Model Configuration Display:
    • Compact model selection format: model_name (CLOUD | INSTANCE | GPU | ENGINE | GPU_CONFIG | PARALLEL)
    • Detailed configuration panel showing:
      • Hardware details (Cloud, Instance, GPU)
      • Software configuration (Engine, GPU Config, Parallelism)
      • Optimization strategy
  • Metric Analysis: Analyze various performance metrics including:
    • Request Latency
    • Time to First Token
    • Inter-token Latency
    • Request Throughput
    • Output Token Throughput
    • Output Token Throughput per Request

Architecture

The application follows a modular architecture with clear separation of concerns. View the detailed Architecture Documentation for:

  • System Components Overview
  • Data Flow Diagram
  • Component Descriptions
  • Technology Stack Details

Interface Preview

Click to view interface screenshots

Latency Distribution View

Latency Distribution Analyze latency distributions across different token configurations with interactive box plots and statistical insights

Model Comparison View

Model Comparison Compare performance metrics between different model configurations and concurrency levels

Overall Interface

Overall UI Complete interface with sidebar controls and metric visualization panels

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/genai-perf-analyzer.git
cd genai-perf-analyzer
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

  1. Place your performance data in the data/ directory following the expected format:
data/
├── meta_llama-3.1-8b_aws_p5.48xlarge_h100_tensorrt_llm-h100-fp8-tp1-pp1-throughput/
│   ├── meta_llama-3.1-8b-instruct-openai-chat-concurrency1/
│   │   ├── 200_200_genai_perf.json    # Input 200, Output 200 tokens
│   │   ├── 200_5_genai_perf.json      # Input 200, Output 5 tokens
│   │   └── 1000_200_genai_perf.json   # Input 1000, Output 200 tokens
│   └── meta_llama-3.1-8b-instruct-openai-chat-concurrency2/
│       └── ...
└── meta_llama-3.1-8b_aws_g5.12xlarge_a10g_tensorrt_llm-a10g-bf16-tp2-latency/
    └── meta_llama-3.1-8b-instruct-openai-chat-concurrency1/
        └── ...

Directory naming convention:

  • Top level: {model_name}_{cloud_provider}_{instance_type}_{gpu_type}_{model_profile} where model_profile contains:
    • Engine type (e.g., tensorrt_llm, vllm)
    • GPU name (e.g., h100, a10g)
    • Precision (e.g., fp8, bf16)
    • Tensor parallelism (e.g., tp1, tp2)
    • Pipeline parallelism (e.g., pp1)
    • Optimization target (throughput/latency)
  • Test runs: {model_name}-{api_type}-concurrency{N}
  • Results: {input_tokens}_{output_tokens}_genai_perf.json

The model profile information is displayed in a consistent format throughout the application:

  • Selection dropdown: meta_llama-3.1-8b (AWS | p5.48xlarge | A100 | TRT | H100-FP8 | TP1-PP1)
  • Detailed view shows:
    • Hardware: Cloud provider, Instance type, GPU model
    • Software: Engine type, GPU configuration, Parallelism strategy
    • Additional: Optimization target
  1. Run the Streamlit application:
streamlit run app/main.py
  1. Access the web interface at http://localhost:8501

Data Format

The analyzer expects performance data in JSON format compatible with NVIDIA's GenAI-Perf output structure:

{
  "request_latency": {
    "unit": "ms",
    "avg": 100.0,
    "p50": 95.0,
    "p90": 150.0,
    "p95": 180.0,
    "p99": 200.0,
    "min": 50.0,
    "max": 250.0,
    "std": 30.0
  },
  // ... other metrics
}

For information about generating performance data, refer to the GenAI-Perf documentation.

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

A streamlined tool for analyzing and visualizing performance metrics of Large Language Models (LLMs) across different configurations and runtime environments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages