I have GPU server with parameters:
8*NVIDIA B200, NVIDIA-SMI 590.48.01, Driver Version: 590.48.01, CUDA Version: 13.1
Trying to run gigachat3.1-ultra:
`
=========================== config.yml ===========================
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: true
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
print_iter_log: false
kv_cache_config:
enable_block_reuse: false
dtype: fp8
free_gpu_memory_fraction: 0.8
stream_interval: 10
moe_config:
backend: DEEPGEMM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
=========================== DOCKER COMPOSE ==============================
services:
trtllm-serve:
image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
container_name: trtllm-server
runtime: nvidia
shm_size: "64gb" # Recommended for larger models
ulimits:
memlock: -1
stack: 67108864
ports:
- "8000:8000"
volumes:
- /GigaChat3.1-702B-A36B:/gigachat3.1-ultra
- /config.yml:/config.yml
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- PYTORCH_ALLOC_CONF=expandable_segments:True
command:
- trtllm-serve
- serve
- "/gigachat3.1-ultra"
- --backend=pytorch
- --host=0.0.0.0
- --port=8000
- --tp_size=8
- --ep_size=8
- --pp_size=1
- --max_batch_size=128
- --max_num_tokens=8192
- --max_seq_len=8192
- --trust_remote_code
- --extra_llm_api_options=/config.yml
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
`
After running 20 parallel POST /v1/chat/completions requests with random questions, I received an out of memory (OOM) error. Is GigaChat really using all my memory or do I need to change something in the settings?
I have GPU server with parameters:
8*NVIDIA B200, NVIDIA-SMI 590.48.01, Driver Version: 590.48.01, CUDA Version: 13.1
Trying to run gigachat3.1-ultra:
`
=========================== config.yml ===========================
cuda_graph_config:
enable_padding: true
max_batch_size: 256
enable_attention_dp: true
attention_dp_config:
batching_wait_iters: 0
enable_balance: true
timeout_iters: 60
print_iter_log: false
kv_cache_config:
enable_block_reuse: false
dtype: fp8
free_gpu_memory_fraction: 0.8
stream_interval: 10
moe_config:
backend: DEEPGEMM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
=========================== DOCKER COMPOSE ==============================
services:
trtllm-serve:
image: nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc10
container_name: trtllm-server
runtime: nvidia
shm_size: "64gb" # Recommended for larger models
ulimits:
memlock: -1
stack: 67108864
ports:
- "8000:8000"
volumes:
- /GigaChat3.1-702B-A36B:/gigachat3.1-ultra
- /config.yml:/config.yml
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- PYTORCH_ALLOC_CONF=expandable_segments:True
command:
- trtllm-serve
- serve
- "/gigachat3.1-ultra"
- --backend=pytorch
- --host=0.0.0.0
- --port=8000
- --tp_size=8
- --ep_size=8
- --pp_size=1
- --max_batch_size=128
- --max_num_tokens=8192
- --max_seq_len=8192
- --trust_remote_code
- --extra_llm_api_options=/config.yml
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
`
After running 20 parallel POST /v1/chat/completions requests with random questions, I received an out of memory (OOM) error. Is GigaChat really using all my memory or do I need to change something in the settings?