👋 Join our WeChat community
GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters. Designed for real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.
Key capabilities include:
-
Exceptional Dialect Support Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects, effectively bridging the gap in dialectal speech recognition.
-
Low-Volume Speech Robustness Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely low-volume audio that traditional models often miss.
-
SOTA Performance Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).
We evaluated GLM-ASR-Nano against leading open-source and closed-source models. The results demonstrate that GLM-ASR-Nano (1.5B) achieves superior performance, particularly in challenging acoustic environments.
Notes:
- Wenet Meeting reflects real-world meeting scenarios with noise and overlapping speech.
- Aishell-1 is a standard Mandarin benchmark.
GLM-ASR-Nano supports 17 languages with high usability (WER ≤ 20%), specifically optimized for the following regions:
| Model | Download Links |
|---|---|
| GLM-ASR-Nano-2512 | 🤗 Hugging Face 🤖 ModelScope |
- Please note that the model weight format has changed after adapting to
transformersandSGLang. If your model was downloaded before December 27, 2025, please pull the latest version of the model.
We provide two test audio clips, in Chinese and English versions respectively.
pip install -r requirements.txt
sudo apt install ffmpeg- transformers 5.0.0, requires installation from source, refer to requirements.txt
from transformers import AutoModel, AutoProcessor
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "zai-org/GLM-ASR-Nano-2512"
processor = AutoProcessor.from_pretrained(repo_id)
model = AutoModel.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)
messages = [
{
"role": "user",
"content": [
{
"type": "audio",
"url": "example_zh.wav",
},
{"type": "text", "text": "Please transcribe this audio into text"},
],
}
]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True))- SGLang
Currently, no release version is available. Please use the latest development docker image.
docker pull lmsysorg/sglang:devEnter the docker container and run
pip install git+https://github.com/huggingface/transformers
python3 -m sglang.launch_server --model-path zai-org/GLM-ASR-Nano-2512 --served-model-name glm-asr --host 0.0.0.0 --port 8000send requests to the server using the following example code:
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:8000/v1"
client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
response = client.chat.completions.create(
model="glm-asr",
messages=[
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": {"url": "example_zh.wav"}
},
{
"type": "text",
"text": "Please transcribe this audio into text"
},
]
}
],
max_tokens=1024,
)
print(response.choices[0].message.content.strip())- transformers 4.51.3 (for models that have not been updated)
python inference.py --checkpoint_dir zai-org/GLM-ASR-Nano-2512 --audio examples/example_en.wav # English
python inference.py --checkpoint_dir zai-org/GLM-ASR-Nano-2512 --audio examples/example_zh.wav # 中文For the two example audio clips above, the model is able to produce accurate transcription results. They are:
be careful not to allow fabric to become too hot which can cause shrinkage or in extreme cases scorch
我还能再搞一个,就算是非常小的声音也能识别准确
