-
Notifications
You must be signed in to change notification settings - Fork 3k
[WIP][BREAKING][worker, ckpt] support checkpoint engine for sync parameters in hybrid mode #4602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a checkpoint_engine to handle weight synchronization in hybrid mode, with changes across FSDP and Megatron workers, as well as vLLM and SGLang rollouts. The implementation adds new configuration flags and logic to conditionally use this new engine.
My review has identified a few critical issues:
- A potential bug in
fsdp_workers.pywhere incorrect weights might be updated due to a variable mix-up. This is coupled with significant code duplication that should be refactored. - Unsafe access to environment variables in
vllm_rollout.pywhich could lead to worker crashes if the environment is not perfectly configured.
I've provided code suggestions to address these critical issues for improved correctness and robustness.
| rank = int(os.environ["RANK"]) | ||
| local_world_size = int(os.environ["RAY_LOCAL_WORLD_SIZE"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing environment variables using os.environ["KEY"] is unsafe as it will raise a KeyError if the variable is not set, causing the worker to crash. It's much safer to use os.getenv("KEY", default_value).
This is a critical issue that can lead to runtime crashes if the environment is not perfectly configured.
| rank = int(os.environ["RANK"]) | |
| local_world_size = int(os.environ["RAY_LOCAL_WORLD_SIZE"]) | |
| rank = int(os.getenv("RANK", "0")) | |
| local_world_size = int(os.getenv("RAY_LOCAL_WORLD_SIZE", "1")) |


What does this PR do?
Implement the
checkpoint engineas a standalone update weights module for hybrid mode.Currently, the inference backend does not support vllm. Adaptation for the vLLM module requires separating actor and rollout into separate processes:#4280
Checklist Before Starting
1. [BREAKING][recipe, ckpt] feat: support parameter sync by checkpoint-engine. only for fully_async mode. #4427
2. [wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy #4601
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)