Skip to content

Support helm deploy and fix flash ckpt engine bug#1716

Open
yifeng-x wants to merge 7 commits into
intelligent-machine-learning:masterfrom
yifeng-x:helm_deploy_and_ckpt_bug
Open

Support helm deploy and fix flash ckpt engine bug#1716
yifeng-x wants to merge 7 commits into
intelligent-machine-learning:masterfrom
yifeng-x:helm_deploy_and_ckpt_bug

Conversation

@yifeng-x
Copy link
Copy Markdown

What changes were proposed in this pull request?

1.Support develop DLRover by Helm.
2.Fix the bug that DLRover fails to properly generate the dlrover_latest.txt file during flash checkpoint in the NPU environment.

Why are the changes needed?

1.Simplify the DLRover deployment process.
2.The training can be normally resumed from checkpoints in the NPU environment.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@BalaBalaYi BalaBalaYi added the enhancement New feature or request label Apr 13, 2026
@BalaBalaYi BalaBalaYi added this to the v0.7.0 milestone Apr 13, 2026
Comment thread dlrover/trainer/torch/flash_checkpoint/engine.py Outdated
Comment thread dlrover/trainer/torch/flash_checkpoint/engine.py
Comment thread helm_install/templates/deployment.yaml Outdated
Comment thread helm_install/crds/crd.yaml
Comment thread helm_install/templates/configmap.yaml Outdated
Comment thread helm_install/helm_developer_guide.md
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 61.11111% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.76%. Comparing base (63c9b95) to head (b04d849).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
dlrover/trainer/torch/flash_checkpoint/engine.py 50.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1716      +/-   ##
==========================================
- Coverage   81.76%   81.76%   -0.01%     
==========================================
  Files         251      251              
  Lines       24360    24375      +15     
==========================================
+ Hits        19919    19931      +12     
- Misses       4441     4444       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yifeng-x yifeng-x requested a review from BalaBalaYi April 27, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants