[Bug] 2张300IA2 64GB卡 启用服务化 进行 70B 量化模型推理，inSize,OutSize设置2048,batch=2可以跑出数据，batch=20时出现Engine callback timeout: server tokenTimeout报错

### 操作系统及版本

openEuler 24.03 (LTS)

### 安装工具的python环境

docker容器中的python环境

### python版本

3.11

### AISBench工具版本

Version: 3.0.0

### AISBench执行命令

ais_bench --models vllm_api_general_stream --datasets synthetic_gen -m perf --debug

### 模型配置文件或自定义配置文件内容

## /usr/local/lib/python3.11/site-packages/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py

from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="",
        model="",
        request_rate = 0,
        retry = 2,
        host_ip = "localhost",
        host_port = 8080,
        max_out_len = 512,
        batch_size=1,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0.5,
            top_k = 10,
            top_p = 0.95,
            seed = None,
            repetition_penalty = 1.03,
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    )
]

## /usr/local/lib/python3.11/site-packages/ais_bench/datasets/synthetic/synthetic_config.py
synthetic_config = {
    "Type":"string",
    "RequestCount": 80,
    "TrustRemoteCode": False,
    "StringConfig" : {
        "Input" : {
            "Method": "uniform",
            "Params": {"MinValue": 2048, f"MaxValue": 2048}
        },
        "Output" : {
            "Method": "uniform",
            "Params": {"MinValue": 2048, "MaxValue": 2048}
        }
    },
    "TokenIdConfig" : {
        "RequestSize": 2048
    }
}


##  /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json
{
    "Version": "1.0.0",
    "ServerConfig": {
        "ipAddress": "127.0.0.1",
        "managementIpAddress": "127.0.0.2",
        "port": 1025,
        "managementPort": 1026,
        "metricsPort": 1027,
        "allowAllZeroIpListening": false,
        "maxLinkNum": 1000,
        "httpsEnabled": false,
        "fullTextEnabled": false,
        "tlsCaPath": "security/ca/",
        "tlsCaFile": [
            "ca.pem"
        ],
        "tlsCert": "security/certs/server.pem",
        "tlsPk": "security/keys/server.key.pem",
        "tlsPkPwd": "security/pass/key_pwd.txt",
        "tlsCrlPath": "security/certs/",
        "tlsCrlFiles": [
            "server_crl.pem"
        ],
        "managementTlsCaFile": [
            "management_ca.pem"
        ],
        "managementTlsCert": "security/certs/management/server.pem",
        "managementTlsPk": "security/keys/management/server.key.pem",
        "managementTlsPkPwd": "security/pass/management/key_pwd.txt",
        "managementTlsCrlPath": "security/management/certs/",
        "managementTlsCrlFiles": [
            "server_crl.pem"
        ],
        "kmcKsfMaster": "tools/pmt/master/ksfa",
        "kmcKsfStandby": "tools/pmt/standby/ksfb",
        "inferMode": "standard",
        "interCommTLSEnabled": true,
        "interCommPort": 1121,
        "interCommTlsCaPath": "security/grpc/ca/",
        "interCommTlsCaFiles": [
            "ca.pem"
        ],
        "interCommTlsCert": "security/grpc/certs/server.pem",
        "interCommPk": "security/grpc/keys/server.key.pem",
        "interCommPkPwd": "security/grpc/pass/key_pwd.txt",
        "interCommTlsCrlPath": "security/grpc/certs/",
        "interCommTlsCrlFiles": [
            "server_crl.pem"
        ],
        "openAiSupport": "vllm",
        "tokenTimeout": 3600,
        "e2eTimeout": 65535,
        "distDPServerEnabled": false
    },
    "BackendConfig": {
        "backendName": "mindieservice_llm_engine",
        "modelInstanceNumber": 1,
        "npuDeviceIds": [
            [
                0,
                1
            ]
        ],
        "tokenizerProcessNumber": 8,
        "multiNodesInferEnabled": false,
        "multiNodesInferPort": 1120,
        "interNodeTLSEnabled": true,
        "interNodeTlsCaPath": "security/grpc/ca/",
        "interNodeTlsCaFiles": [
            "ca.pem"
        ],
        "interNodeTlsCert": "security/grpc/certs/server.pem",
        "interNodeTlsPk": "security/grpc/keys/server.key.pem",
        "interNodeTlsPkPwd": "security/grpc/pass/mindie_server_key_pwd.txt",
        "interNodeTlsCrlPath": "security/grpc/certs/",
        "interNodeTlsCrlFiles": [
            "server_crl.pem"
        ],
        "interNodeKmcKsfMaster": "tools/pmt/master/ksfa",
        "interNodeKmcKsfStandby": "tools/pmt/standby/ksfb",
        "kvPoolConfig": {
            "backend": "",
            "configPath": ""
        },
        "ModelDeployConfig": {
            "maxSeqLen": 6144,
            "maxInputTokenLen": 4096,
            "truncation": false,
            "ModelConfig": [
                {
                    "modelInstanceType": "Standard",
                    "modelName": "dony_w8a8_test",
                    "modelWeightPath": "/mnt/DeepSeek-R1-Distill-Llama-70B-w8a8",
                    "worldSize": 2,
                    "cpuMemSize": 5,
                    "npuMemSize": 10,
                    "backendType": "atb",
                    "trustRemoteCode": false,
                    "async_scheduler_wait_time": 120,
                    "kv_trans_timeout": 10,
                    "kv_link_timeout": 1080
                }
            ]
        },
        "ScheduleConfig": {
            "templateType": "Standard",
            "templateName": "Standard_LLM",
            "cacheBlockSize": 128,
            "maxPrefillBatchSize": 50,
            "maxPrefillTokens": 6144,
            "prefillTimeMsPerReq": 150,
            "prefillPolicyType": 0,
            "decodeTimeMsPerReq": 50,
            "decodePolicyType": 0,
            "maxBatchSize": 200,
            "maxIterTimes": 4096,
            "maxPreemptCount": 0,
            "supportSelectBatch": true,
            "maxQueueDelayMicroseconds": 5000,
            "maxFirstTokenWaitTime": 2500
        }
    },
    "LogConfig": {
        "dynamicLogLevel": "",
        "dynamicLogLevelValidHours": 2,
        "dynamicLogLevelValidTime": ""
    }
}


## docker容器版本
REPOSITORY                                          TAG                                        IMAGE ID            CREATED             SIZE
swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie   2.2.RC1-800I-A2-py311-openeuler24.03-lts   3006ee810455        4 weeks ago         18.6GB

### 预期行为

预期行为 insize 2048 outsize 2048 batch 20 可以正常输出数据且没有报错。

### 实际行为

12/23 17:22:53 - AISBench - INFO - Loading synthetic_gen: /usr/local/lib/python3.11/site-packages/ais_bench/benchmark/configs/./datasets/synthetic/synthetic_gen.py
12/23 17:22:53 - AISBench - INFO - Loading vllm_api_general_stream: /usr/local/lib/python3.11/site-packages/ais_bench/benchmark/configs/./models/vllm_api/vllm_api_general_stream.py
12/23 17:22:53 - AISBench - INFO - Loading example: /usr/local/lib/python3.11/site-packages/ais_bench/benchmark/configs/./summarizers/example.py
12/23 17:22:53 - AISBench - INFO - Current exp folder: outputs/default/20251223_172253
12/23 17:22:53 - AISBench - INFO - Starting performance evaluation tasks...
12/23 17:22:53 - AISBench - INFO - Partitioned into 1 tasks.
12/23 17:23:00 - AISBench - INFO - Task [vllm-api-general-stream/synthetic]
12/23 17:23:03 - AISBench - INFO - Start load data of [vllm-api-general-stream/synthetic]
12/23 17:23:03 - AISBench - WARNING - Parameter 'burstiness' is None. Using default: 0.0
12/23 17:23:03 - AISBench - WARNING - Parameter 'ramp_up_strategy' is None. Using default: None
12/23 17:23:03 - AISBench - WARNING - Parameter 'ramp_up_start_rps' is None. Using default: None
12/23 17:23:03 - AISBench - WARNING - Parameter 'ramp_up_end_rps' is None. Using default: None
12/23 17:23:04 - AISBench - INFO - RPS distribution charts saved to outputs/default/20251223_172253/performances/vllm-api-general-stream/syntheticdataset_rps_distribution_plot.html
12/23 17:23:04 - AISBench - INFO - RPS distribution chart JSON data saved to outputs/default/20251223_172253/performances/vllm-api-general-stream/syntheticdataset_rps_distribution_plot.json
12/23 17:23:04 - AISBench - INFO -
Request Per Second (RPS) Distribution Summary
Metric                   Value
-----------------------  ---------------------------------------------------------------------------------------------
Total Requests           80
Request Classification   Normal: 80 | Timing Anomaly: 0 | Burstiness Anomaly: 0 | Infinite RPS Anomaly: 0
Target Rate              1000.00 RPS
Burstiness               0.000
Normal RPS               1000.00 ± 0.00
Normal RPS Range         1000.00-1000.00
Interval Stats           Avg: 0.001s | Min: 0.001s | Max: 0.001s
Interval Classification  Normal (Normal + Burstiness Anomaly): 80 | Anomaly (Timing Anomaly + Infinite RPS Anomaly): 0

12/23 17:23:04 - AISBench - INFO - Process 0 using precomputed sleep offsets with 80 requests
12/23 18:24:46 - AISBench - ERROR - /usr/local/lib/python3.11/site-packages/ais_bench/benchmark/clients/base_client.py - raise_error - 35 - [AisBenchClientException] Error processing stream response: [StreamResponseError] Expecting value: line 1 column 1 (char 0)! Raw server response: b'Engine callback timeout: server tokenTimeout'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] 2张300IA2 64GB卡启用服务化进行 70B 量化模型推理，inSize,OutSize设置2048,batch=2可以跑出数据，batch=20时出现Engine callback timeout: server tokenTimeout报错 #15

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

/usr/local/lib/python3.11/site-packages/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py

/usr/local/lib/python3.11/site-packages/ais_bench/datasets/synthetic/synthetic_config.py

/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

docker容器版本

预期行为

实际行为

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] 2张300IA2 64GB卡 启用服务化 进行 70B 量化模型推理，inSize,OutSize设置2048,batch=2可以跑出数据，batch=20时出现Engine callback timeout: server tokenTimeout报错 #15

Description

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

/usr/local/lib/python3.11/site-packages/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py

/usr/local/lib/python3.11/site-packages/ais_bench/datasets/synthetic/synthetic_config.py

/usr/local/Ascend/mindie/latest/mindie-service/conf/config.json

docker容器版本

预期行为

实际行为

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] 2张300IA2 64GB卡启用服务化进行 70B 量化模型推理，inSize,OutSize设置2048,batch=2可以跑出数据，batch=20时出现Engine callback timeout: server tokenTimeout报错 #15