[1/2] feat/add_fault_torlance #1404

lilei199908 · 2026-01-13T05:47:34Z

this is sglang patch for images ci for following pr:
#1311

Copilot

Pull request overview

This pull request adds release and resume functionality for memory regions (MR) to the sglang codebase, enabling better memory management during model weight updates and memory saver operations. This is part 1 of a 2-part change series.

Changes:

Adds memory region registration and unregistration methods to the ModelRunner for remote instance transfer engine
Introduces ReleaseMemoryOccupationReqInput and ResumeMemoryOccupationReqInput request types with corresponding handler methods
Integrates memory region lifecycle management with memory saver pause/resume operations and removes the TransferEngine incompatibility check with memory saver

Comments suppressed due to low confidence (2)

docker/patch/latest/sglang.patch:528

The ReleaseMemoryOccupationReqInput and ResumeMemoryOccupationReqInput dataclasses define a 'tag' field but don't provide any documentation about what valid values are expected or what different tags mean. Consider adding docstrings to these classes explaining the purpose and valid values for the tag field.

@@ -1286,6 +1286,19 @@ class UpdateWeightsFromIPCReqOutput(BaseReq):
     success: bool
     message: str
 
+@dataclass
+class PostProcessWeightsReqInput(BaseReq):
+    # Whether to restore weights before loading new weights
+    restore_weights_before_load: bool = False
+    # Whether to enable quantization post-processing
+    post_process_quantization: bool = False
+
+
+@dataclass
+class PostProcessWeightsReqOutput(BaseReq):
+    success: bool
+    message: str
+

docker/patch/latest/sglang.patch:828

The conditional logic for checking stream usage has been modified to nest the enable_dual_stream check inside the existing conditional. However, the logic change appears to alter when routed experts are used. The original code would use routed experts when NOT using dual stream, but the new code requires both enable_dual_stream to be true AND the stream check to pass. This could change the behavior significantly. Verify this logic change is intentional and correct.

         output.expert_distribution_metrics = recorder_outputs.get("metrics")
 
         # Copy cached routing experts' buffers back to CPU cache
-        get_global_experts_capturer().on_forward_end(
-            forward_batch=forward_batch,
-            can_run_graph=output.can_run_graph,
-            cuda_graph_batch=getattr(self.graph_runner, "bs", None),
-        )
+        if not self.is_draft_worker:
+            # In speculative decoding, num_tokens_per_bs > 1, so we need to pass
+            # the actual number of tokens per dp rank in cuda graph, not batch size.
+            cuda_graph_num_tokens = None
+            if getattr(self.graph_runner, "bs", None):
+                cuda_graph_num_tokens = (
+                    self.graph_runner.bs * self.graph_runner.num_tokens_per_bs
+                )
+            get_global_experts_capturer().on_forward_end(

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-13T05:53:46Z