GLM-ASR

👋 Join our WeChat community

Model Introduction

GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters. Designed for real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.

Key capabilities include:

Exceptional Dialect Support Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects, effectively bridging the gap in dialectal speech recognition.
Low-Volume Speech Robustness Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely low-volume audio that traditional models often miss.
SOTA Performance Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).

Benchmark

We evaluated GLM-ASR-Nano against leading open-source and closed-source models. The results demonstrate that GLM-ASR-Nano (1.5B) achieves superior performance, particularly in challenging acoustic environments.

Notes:

Wenet Meeting reflects real-world meeting scenarios with noise and overlapping speech.
Aishell-1 is a standard Mandarin benchmark.

Supported Languages

GLM-ASR-Nano supports 17 languages with high usability (WER ≤ 20%), specifically optimized for the following regions:

Download

Model	Download Links
GLM-ASR-Nano-2512	🤗 Hugging Face 🤖 ModelScope

Please note that the model weight format has changed after adapting to transformers and SGLang. If your model was downloaded before December 27, 2025, please pull the latest version of the model.

Inference

We provide two test audio clips, in Chinese and English versions respectively.

Requirements

pip install -r requirements.txt
sudo apt install ffmpeg

Example Code

transformers 5.0.0, requires installation from source, refer to requirements.txt

from transformers import AutoModel, AutoProcessor
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
repo_id = "zai-org/GLM-ASR-Nano-2512"

processor = AutoProcessor.from_pretrained(repo_id)
model = AutoModel.from_pretrained(repo_id, dtype=torch.bfloat16, device_map=device)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "url": "example_zh.wav",
            },
            {"type": "text", "text": "Please transcribe this audio into text"},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt"
)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True))

SGLang

Currently, no release version is available. Please use the latest development docker image.

docker pull lmsysorg/sglang:dev

Enter the docker container and run

pip install git+https://github.com/huggingface/transformers
python3 -m sglang.launch_server --model-path zai-org/GLM-ASR-Nano-2512 --served-model-name glm-asr --host 0.0.0.0 --port 8000

send requests to the server using the following example code:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:8000/v1"

client = OpenAI(api_key=openai_api_key, base_url=openai_api_base)
response = client.chat.completions.create(
    model="glm-asr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {"url": "example_zh.wav"}
                },
                {
                    "type": "text",
                    "text": "Please transcribe this audio into text"
                },
            ]
        }
    ],
    max_tokens=1024,
)
print(response.choices[0].message.content.strip())

transformers 4.51.3 (for models that have not been updated)

python inference.py --checkpoint_dir zai-org/GLM-ASR-Nano-2512 --audio examples/example_en.wav # English
python inference.py --checkpoint_dir zai-org/GLM-ASR-Nano-2512 --audio examples/example_zh.wav # 中文

For the two example audio clips above, the model is able to produce accurate transcription results. They are:

be careful not to allow fabric to become too hot which can cause shrinkage or in extreme cases scorch
我还能再搞一个，就算是非常小的声音也能识别准确

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
examples		examples
resources		resources
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GLM-ASR

Model Introduction

Benchmark

Supported Languages

Download

Inference

Requirements

Example Code

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

zai-org/GLM-ASR

Folders and files

Latest commit

History

Repository files navigation

GLM-ASR

Model Introduction

Benchmark

Supported Languages

Download

Inference

Requirements

Example Code

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages