Ascend Skill Contest 参赛提交训练题目3#10
Open
Michael-Salon wants to merge 2 commits into
Open
Conversation
- Support CPU, memory, NPU, HCCL data collection - Support level0/level1/level2 profiling levels - Support step range collection (start_step, end_step) - Include VeRL profiler configuration guide - Include common issues troubleshooting
Author
|
这里面一起带上了题目二的commit;另外当前版本会优先检测当前环境是否有合适环境可以直接进行迁移,没有的话会进行下载依赖操作,取决于服务器网络情况,在不通网或者网络环境特别差的情况下可能会失败 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
题目3: PyTorch GPU→NPU 迁移
Skill名称
npu-transfer
训练框架
pytorch FSDP2
Prompt
我需要将 PyTorch Examples 仓库中 FSDP2 的 nanoGPT 训练任务从 GPU 环境迁移到 NPU 环境,请帮我完成适配,需要接入融合注意力算子,使用 torchrun 启动 2 卡分布式训练,并验证 checkpoint 保存和加载功能。
交互过程
用户 : 帮我把 /home/xxxxx/examples-main/distributed/FSDP2 迁移到 NPU,输出到 /home/xxxxx/test-output
Agent :
用户 : 在宿主机执行 torchrun --nproc_per_node 2 example_npu.py --mixed-precision
Agent : 训练成功完成,输出:
用户 : 再次执行验证 checkpoint 加载
Agent : 第二次训练从 checkpoint 继续: