Ascend Skill Contest 参赛提交训练题目3 by Michael-Salon · Pull Request #10 · Eco-Sphere/ascend-skill-contest

Michael-Salon · 2026-03-19T09:30:14Z

题目3: PyTorch GPU→NPU 迁移

Skill名称

npu-transfer

训练框架

pytorch FSDP2

Prompt

我需要将 PyTorch Examples 仓库中 FSDP2 的 nanoGPT 训练任务从 GPU 环境迁移到 NPU 环境，请帮我完成适配，需要接入融合注意力算子，使用 torchrun 启动 2 卡分布式训练，并验证 checkpoint 保存和加载功能。

交互过程

用户 : 帮我把 /home/xxxxx/examples-main/distributed/FSDP2 迁移到 NPU，输出到 /home/xxxxx/test-output

Agent :

分析源代码，识别训练框架为 FSDP2
识别设备相关代码： torch.accelerator.is_available() 、 torch.accelerator.current_accelerator()
识别注意力实现： F.scaled_dot_product_attention
生成适配代码：
- model_npu.py : 新增 NPUAttention 类，使用 npu_fusion_attention （BNSD格式）
- example_npu.py : 设备检测、初始化、HCCL后端适配
- run_npu.sh : NPU启动脚本
- requirements_npu.txt : 环境依赖
  用户 : 在宿主机执行 torchrun --nproc_per_node 2 example_npu.py --mixed-precision

Agent : 训练成功完成，输出：

Step 0: loss = -1892.3685
...
Step 9: loss = -1836591.2500
Training completed, checkpoint saved.

用户 : 再次执行验证 checkpoint 加载

Agent : 第二次训练从 checkpoint 继续：

Step 0: loss = -2204390.5000
...
Step 9: loss = -6416212.0000
Training completed, checkpoint saved.

- Support CPU, memory, NPU, HCCL data collection - Support level0/level1/level2 profiling levels - Support step range collection (start_step, end_step) - Include VeRL profiler configuration guide - Include common issues troubleshooting

Michael-Salon · 2026-03-19T09:40:03Z

这里面一起带上了题目二的commit；另外当前版本会优先检测当前环境是否有合适环境可以直接进行迁移，没有的话会进行下载依赖操作，取决于服务器网络情况，在不通网或者网络环境特别差的情况下可能会失败

scysw2 and others added 2 commits March 17, 2026 14:21

Add npu-transfer skill for GPU to NPU migration

8b7ecdc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ascend Skill Contest 参赛提交训练题目3#10

Ascend Skill Contest 参赛提交训练题目3#10
Michael-Salon wants to merge 2 commits into
Eco-Sphere:mainfrom
Michael-Salon:feature/npu-transfer

Michael-Salon commented Mar 19, 2026 •

edited

Loading

Uh oh!

Michael-Salon commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Michael-Salon commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

题目3: PyTorch GPU→NPU 迁移

Skill名称

训练框架

Prompt

交互过程

Uh oh!

Michael-Salon commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Michael-Salon commented Mar 19, 2026 •

edited

Loading