Skip to content

OOM with TensorRT LLM #4

@yuraMan07

Description

@yuraMan07

I have GPU server with parameters:
8*NVIDIA B200, NVIDIA-SMI 590.48.01, Driver Version: 590.48.01, CUDA Version: 13.1

Trying to run gigachat3.1-ultra:
`
=========================== config.yml ===========================
cuda_graph_config:
enable_padding: true
max_batch_size: 256

enable_attention_dp: true
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60

print_iter_log: false

kv_cache_config:
enable_block_reuse: false
dtype: fp8
free_gpu_memory_fraction: 0.8

stream_interval: 10

moe_config:
backend: DEEPGEMM

speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1

=========================== DOCKER COMPOSE ==============================

services:
trtllm-serve:
image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
container_name: trtllm-server
runtime: nvidia
shm_size: "64gb" # Recommended for larger models
ulimits:
memlock: -1
stack: 67108864
ports:
- "8000:8000"
volumes:
- /GigaChat3.1-702B-A36B:/gigachat3.1-ultra
- /config.yml:/config.yml
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- PYTORCH_ALLOC_CONF=expandable_segments:True
command:
- trtllm-serve
- serve
- "/gigachat3.1-ultra"
- --backend=pytorch
- --host=0.0.0.0
- --port=8000
- --tp_size=8
- --ep_size=8
- --pp_size=1
- --max_batch_size=128
- --max_num_tokens=8192
- --max_seq_len=8192
- --trust_remote_code
- --extra_llm_api_options=/config.yml
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
`
After running 20 parallel POST /v1/chat/completions requests with random questions, I received an out of memory (OOM) error. Is GigaChat really using all my memory or do I need to change something in the settings?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions