Skip to content

[Bug]: Qwen3-VL-30B-A3B-Instruct模型(E:TP=1,PD: TP=4),偶现实例拉起报错gloo connect full mesh failed #185

@yenuo26

Description

@yenuo26

Your current environment

The output of vllm python collect_env.py vllm commit link: c1378b8
The output of vllm-ascend python collect_env.py vllm-ascend commit link: f78db0894660f3e64afb29b204aeb204806ffe08
The output of llm-service commit link: 5c37e8dbc71bfefd0c0fc2e00cca219221000e21

🐛 Describe the bug

Run the following command to reproduce the error: 1E1PD e_server_args = [ "--model", model, "--gpu-memory-utilization", "0.0", "--tensor-parallel-size", "1", "--enforce-eager", "--no-enable-prefix-caching", "--max-model-len", "20000", "--max-num-batched-tokens", "20000", "--max-num-seqs", "1", "--ec-transfer-config", '{"ec_connector_extra_config":{"shared_storage_path":"' + SHARED_STORAGE_PATH + '"},"ec_connector":"ECSharedStorageConnector","ec_role": "ec_producer"}' ] pd_server_args = [ "--model", model, "--gpu-memory-utilization", "0.9", "--tensor-parallel-size", "4", "--enforce-eager", "--max-model-len", "20000", "--max-num-batched-tokens", "20000", "--max-num-seqs", "128", "--ec-transfer-config", '{"ec_connector_extra_config":{"shared_storage_path":"' + SHARED_STORAGE_PATH + '"},"ec_connector":"ECSharedStorageConnector","ec_role": "ec_consumer"}' ]
Error output: RuntimeError: Gloo connnectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions