OOM with TensorRT LLM

I have GPU server with parameters:
8*NVIDIA B200, NVIDIA-SMI 590.48.01, Driver Version: 590.48.01, CUDA Version: 13.1

Trying to run gigachat3.1-ultra:
`
=========================== config.yml ===========================
cuda_graph_config:
  enable_padding: true
  max_batch_size: 256

enable_attention_dp: true
attention_dp_config:
  batching_wait_iters: 0
  enable_balance: true
  timeout_iters: 60

print_iter_log: false

kv_cache_config:
  enable_block_reuse: false
  dtype: fp8
  free_gpu_memory_fraction: 0.8

stream_interval: 10

moe_config:
  backend: DEEPGEMM

speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 1


=========================== DOCKER COMPOSE ==============================

services:
  trtllm-serve:
    image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
    container_name: trtllm-server
    runtime: nvidia
    shm_size: "64gb" # Recommended for larger models
    ulimits:
      memlock: -1
      stack: 67108864
    ports:
      - "8000:8000"
    volumes:
      - /GigaChat3.1-702B-A36B:/gigachat3.1-ultra
      - /config.yml:/config.yml
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - PYTORCH_ALLOC_CONF=expandable_segments:True
    command:
      - trtllm-serve
      - serve
      - "/gigachat3.1-ultra"
      - --backend=pytorch
      - --host=0.0.0.0
      - --port=8000
      - --tp_size=8
      - --ep_size=8
      - --pp_size=1
      - --max_batch_size=128
      - --max_num_tokens=8192
      - --max_seq_len=8192
      - --trust_remote_code
      - --extra_llm_api_options=/config.yml
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
`
After running 20 parallel POST /v1/chat/completions requests with random questions, I received an out of memory (OOM) error. Is GigaChat really using all my memory or do I need to change something in the settings?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM with TensorRT LLM #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM with TensorRT LLM #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions