- vLLM version: 0.15.0
- PyTorch version: (from vllm/vllm-openai:latest docker image)
- CUDA version: 12.8
- GPU: 2x NVIDIA H100 NVL (95830 MiB each)
- Driver: 570.133.20
- OS: Linux (Docker)
Command
vllm serve Qwen/Qwen3-Coder-Next \
--max-model-len 16000 \
--tensor-parallel-size 2
CUDA graph capture fails during model initialization with an AssertionError in causal_conv1d_update. The error occurs at line 1160 in causal_conv1d.py:
ERROR 02-03 13:39:49 [multiproc_executor.py:852] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1320, in
gdn_attention_core
ERROR 02-03 13:39:49 [multiproc_executor.py:852] self._forward_core(
ERROR 02-03 13:39:49 [multiproc_executor.py:852] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 585, in
_forward_core
ERROR 02-03 13:39:49 [multiproc_executor.py:852] mixed_qkv_non_spec = causal_conv1d_update(
ERROR 02-03 13:39:49 [multiproc_executor.py:852] ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 13:39:49 [multiproc_executor.py:852] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1160,
in causal_conv1d_update
ERROR 02-03 13:39:49 [multiproc_executor.py:852] assert num_cache_lines >= batch
ERROR 02-03 13:39:49 [multiproc_executor.py:852] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-03 13:39:49 [multiproc_executor.py:852] AssertionError
The error occurs during compile_or_warm_up_model → capture_model → _capture_cudagraphs → _dummy_run.
Workaround: Adding --enforce-eager allows the model to run, but with reduced performance (~12 tokens/s vs expected 20+ tokens/s with CUDA graphs).
Command
CUDA graph capture fails during model initialization with an AssertionError in causal_conv1d_update. The error occurs at line 1160 in causal_conv1d.py:
The error occurs during compile_or_warm_up_model → capture_model → _capture_cudagraphs → _dummy_run.
Workaround: Adding --enforce-eager allows the model to run, but with reduced performance (~12 tokens/s vs expected 20+ tokens/s with CUDA graphs).