Skip to content

Conversation

@zheliuyu
Copy link
Contributor

@zheliuyu zheliuyu commented Nov 19, 2025

What does this PR do?

Still going strong and continuing to attempt. 😆

This is a proof-of-concept experiment for #39105 (comment)

Prepare the env

cann= 8.3.RC1
torch= 2.7.1
torch_npu= 2.7.1
device= Atlas 900 A2 * 8

pip install -e kernels

git clone https://github.com/zheliuyu/transformers-dev
pip install -e transformers-dev

pip install llamafactory

Using LLaMA-Factory, we fine-tuned Qwen3-8B.

llamafactory-cli train custom.yaml

custom.yaml

### model
model_name_or_path: Qwen/Qwen3-8B
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/Qwen/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

result

No kernels

Set use_kernels=False

***** train metrics *****
"epoch": 3.0,
"total_flos": 2.8124,
"train_loss": 1.159449,
"train_runtime": 303.7698,
"train_samples_per_second": 10.7650,
"train_steps_per_second": 0.1780

Use this pr

Set use_kernels=True

***** train metrics *****
"epoch": 3.0,
"total_flos": 2.8124,
"train_loss": 1.159411,
"train_runtime": 272.5237,
"train_samples_per_second": 11.9990,
"train_steps_per_second": 0.1980

(303.7698 - 272.5237) / 303.7698 ≈ 10.2%

The results show an approximate 10% speedup from w/o this pr.

@zheliuyu zheliuyu marked this pull request as ready for review November 19, 2025 09:22
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a comment

Comment on lines -88 to +90
Mode.INFERENCE: LayerRepository(
repo_id="kernels-community/liger_kernels",
layer_name="LigerRMSNorm",
Mode.TRAINING: LayerRepository(
repo_id="kernels-ext-npu/rmsnorm",
layer_name="rmsnorm",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for inference should we still keep liger_kernels ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @zheliuyu, I have a few concerns about including kernels from other communities that may not yet be fully mature in the default mapping of Transformers, since it's code being run on users devices, and we need to keep control of what's being executed. I would kindly suggest using the KernelConfig directly and specifying the desired mapping there instead of using the default one for now. For example:

kernel_config = KernelConfig(kernel_mapping={"RMSNorm": "kernels-ext-npu/rmsnorm:rmsnorm"})
model = AutoModelForCausalLM.from_pretrained(
            "unsloth/Llama-3.2-1B-Instruct", use_kernels=True, device_map=torch_device, kernel_config=kernel_config
        )

Once the npu community is mature enough we can consider adding kernels to the default mapping directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants