-
Notifications
You must be signed in to change notification settings - Fork 477
[1/2] feat/add_fault_torlance #1404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[1/2] feat/add_fault_torlance #1404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds release and resume functionality for memory regions (MR) to the sglang codebase, enabling better memory management during model weight updates and memory saver operations. This is part 1 of a 2-part change series.
Changes:
- Adds memory region registration and unregistration methods to the ModelRunner for remote instance transfer engine
- Introduces ReleaseMemoryOccupationReqInput and ResumeMemoryOccupationReqInput request types with corresponding handler methods
- Integrates memory region lifecycle management with memory saver pause/resume operations and removes the TransferEngine incompatibility check with memory saver
Comments suppressed due to low confidence (2)
docker/patch/latest/sglang.patch:528
- The ReleaseMemoryOccupationReqInput and ResumeMemoryOccupationReqInput dataclasses define a 'tag' field but don't provide any documentation about what valid values are expected or what different tags mean. Consider adding docstrings to these classes explaining the purpose and valid values for the tag field.
@@ -1286,6 +1286,19 @@ class UpdateWeightsFromIPCReqOutput(BaseReq):
success: bool
message: str
+@dataclass
+class PostProcessWeightsReqInput(BaseReq):
+ # Whether to restore weights before loading new weights
+ restore_weights_before_load: bool = False
+ # Whether to enable quantization post-processing
+ post_process_quantization: bool = False
+
+
+@dataclass
+class PostProcessWeightsReqOutput(BaseReq):
+ success: bool
+ message: str
+
docker/patch/latest/sglang.patch:828
- The conditional logic for checking stream usage has been modified to nest the enable_dual_stream check inside the existing conditional. However, the logic change appears to alter when routed experts are used. The original code would use routed experts when NOT using dual stream, but the new code requires both enable_dual_stream to be true AND the stream check to pass. This could change the behavior significantly. Verify this logic change is intentional and correct.
output.expert_distribution_metrics = recorder_outputs.get("metrics")
# Copy cached routing experts' buffers back to CPU cache
- get_global_experts_capturer().on_forward_end(
- forward_batch=forward_batch,
- can_run_graph=output.can_run_graph,
- cuda_graph_batch=getattr(self.graph_runner, "bs", None),
- )
+ if not self.is_draft_worker:
+ # In speculative decoding, num_tokens_per_bs > 1, so we need to pass
+ # the actual number of tokens per dp rank in cuda graph, not batch size.
+ cuda_graph_num_tokens = None
+ if getattr(self.graph_runner, "bs", None):
+ cuda_graph_num_tokens = (
+ self.graph_runner.bs * self.graph_runner.num_tokens_per_bs
+ )
+ get_global_experts_capturer().on_forward_end(
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| + # elif self.enable_memory_saver: | ||
| + # logger.warning( | ||
| + # "Memory saver is enabled, which is not compatible with TransferEngine. Does not support using TransferEngine as remote instance weight loader backend." | ||
| + # ) | ||
| + # return False |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented-out code that checks for memory saver compatibility with TransferEngine represents a significant behavioral change. Removing this safety check without proper justification could lead to runtime issues. The original check prevented using TransferEngine when memory saver is enabled due to incompatibility. If this restriction is being removed, there should be clear evidence that the incompatibility has been resolved, or documentation explaining why this check is no longer necessary.
| + # elif self.enable_memory_saver: | |
| + # logger.warning( | |
| + # "Memory saver is enabled, which is not compatible with TransferEngine. Does not support using TransferEngine as remote instance weight loader backend." | |
| + # ) | |
| + # return False | |
| + elif self.enable_memory_saver: | |
| + logger.warning( | |
| + "Memory saver is enabled, which is not compatible with TransferEngine. Does not support using TransferEngine as remote instance weight loader backend." | |
| + ) | |
| + return False |
| + try: | ||
| + self.remote_instance_transfer_engine.unregister_memory(addr) | ||
| + except Exception as e: | ||
| + logger.debug(f"Failed to unregister memory at {addr}: {e}") |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling in the unregister_memory method catches all exceptions but only logs them at debug level. If unregistering memory fails, this could lead to memory leaks or resource management issues. Consider logging at warning or error level instead of debug, and potentially tracking whether unregistration succeeded to handle cleanup failures appropriately.
| + logger.debug(f"Failed to unregister memory at {addr}: {e}") | |
| + logger.warning(f"Failed to unregister memory at {addr}: {e}") |
| + if self.remote_instance_transfer_engine is None: | ||
| + return | ||
| + | ||
| + logger.debug("Unregistering old memory regions from transfer engine") |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging message uses "Unregistering old memory regions" which could be misleading. This method is called during memory saver pause operations, not just for "old" regions. Consider making the log message more specific, such as "Unregistering memory regions during memory saver pause" to clarify the context.
| + logger.debug("Unregistering old memory regions from transfer engine") | |
| + logger.debug("Unregistering memory regions from transfer engine during memory saver pause") |
| + self.model, self.remote_instance_transfer_engine | ||
| + ) | ||
| + | ||
| + def remote_instance_unregister_memory_region(self): |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method name "remote_instance_unregister_memory_region" uses singular "region" while the method actually handles multiple regions. For consistency with the method's functionality and the logging message within it ("regions" plural), consider renaming to "remote_instance_unregister_memory_regions".
| + def remote_instance_unregister_memory_region(self): | |
| + def remote_instance_unregister_memory_regions(self): |
| + for name, (data_ptr, numel, element_size) in ( | ||
| + self.remote_instance_transfer_engine_weight_info.items() | ||
| + ): | ||
| + if data_ptr not in old_addrs: | ||
| + old_addrs.add(data_ptr) | ||
| + registered_blocks.append((data_ptr, numel * element_size)) | ||
| + | ||
| + for addr, size in registered_blocks: | ||
| + try: | ||
| + self.remote_instance_transfer_engine.unregister_memory(addr) | ||
| + except Exception as e: | ||
| + logger.debug(f"Failed to unregister memory at {addr}: {e}") | ||
| + | ||
| + self.remote_instance_transfer_engine_weight_info = None |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method sets the remote_instance_transfer_engine_weight_info to None after unregistering memory regions, but there's no check to ensure the dictionary exists before iteration. If this method is called multiple times or in an unexpected order, it could raise an AttributeError. Consider adding a check for None or using hasattr before accessing the dictionary.
| torch.distributed.barrier(self.tp_cpu_group) | ||
| + self.tp_worker.model_runner.remote_instance_unregister_memory_region() | ||
| self.memory_saver_adapter.pause(GPU_MEMORY_TYPE_WEIGHTS) | ||
|
|
||
| if GPU_MEMORY_TYPE_CUDA_GRAPH in tags: | ||
| @@ -173,10 +192,18 @@ class SchedulerUpdateWeightsMixin: | ||
| self.stashed_model_static_state, | ||
| ) | ||
| del self.stashed_model_static_state | ||
| + self.tp_worker.model_runner.remote_instance_register_memory_region() |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The barrier synchronization and memory region unregistration/registration calls need to be carefully ordered. The current implementation calls unregister after barrier and register after barrier, but if one process fails to unregister, it could lead to inconsistent state across processes. Consider adding error handling or verification that all processes successfully complete the memory operations before proceeding.
| @@ -1,5 +1,5 @@ | |||
| diff --git a/python/sglang/srt/disaggregation/decode.py b/python/sglang/srt/disaggregation/decode.py | |||
| index 199885244..742ad0639 100644 | |||
| index 1998852..742ad06 100644 | |||
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The git index hashes have been changed from 9 characters to 7 characters throughout the patch file (e.g., "199885244..742ad0639" changed to "1998852..742ad06"). While 7-character short hashes are typically sufficient, this inconsistency in hash length could cause issues if tools expect a specific format. Ensure that all tools consuming this patch file can handle the shorter hash format.
| index 1998852..742ad06 100644 | |
| index 199885244..742ad0639 100644 |
this is sglang patch for images ci for following pr:
#1311