feat(xpu): add Intel GPU (XPU) support#1110
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds Intel XPU (PyTorch XPU) support across runtime device detection, Python model components, C++/pybind ops, and Bazel build/packaging so the project can run on Intel GPUs with appropriate fallbacks.
Changes:
- Introduce XPU device type and GPU-agnostic helpers (availability/count/device selection/visible devices).
- Add XPU-specific Python modules (attention SDPA + vLLM kernels wrapper, norms/activations/linear strategies) and KV-cache layout handling.
- Extend C++ runtime/ops and Bazel toolchain + wheel metadata to support
--config=xpubuilds (Python 3.12, SYCL toolchain, XPU bindings).
Reviewed changes
Copilot reviewed 92 out of 96 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| rtp_llm/start_backend_server.py | Switch GPU detection to device-agnostic helpers and add VIT separation server path. |
| rtp_llm/ops/init.py | Add XPU detection logging and make libpython preload version-agnostic. |
| rtp_llm/models_py/utils/arch.py | Extend device-type utilities import to include XPU helpers. |
| rtp_llm/models_py/standalone/auto_model.py | Select xpu device when available; adjust KV layout and pin_memory behavior for XPU. |
| rtp_llm/models_py/modules/hybrid/causal_attention.py | Add XPU norm import for hybrid causal attention. |
| rtp_llm/models_py/modules/factory/linear/impl/xpu/f16_linear.py | Add XPU F16/BF16 Linear backend using PyTorch F.linear. |
| rtp_llm/models_py/modules/factory/linear/impl/xpu/init.py | Register XPU Linear strategies in the factory. |
| rtp_llm/models_py/modules/factory/linear/init.py | Route Linear factory registration to XPU strategies when on XPU. |
| rtp_llm/models_py/modules/factory/fused_moe/impl/xpu/init.py | Add XPU MoE placeholder module. |
| rtp_llm/models_py/modules/factory/fused_moe/init.py | Configure MoE registry for XPU to use batched Triton fallback. |
| rtp_llm/models_py/modules/factory/attention/xpu_impl/test/test_kv_cache_layout.py | Add CPU-runnable test guarding XPU KV cache NSHD layout contract. |
| rtp_llm/models_py/modules/factory/attention/xpu_impl/test/BUILD | Bazel target for KV cache layout test. |
| rtp_llm/models_py/modules/factory/attention/xpu_impl/sdpa.py | Add XPU SDPA attention implementations for prefill/decode with RoPE + paged cache. |
| rtp_llm/models_py/modules/factory/attention/xpu_impl/init.py | Add XPU attention package marker. |
| rtp_llm/models_py/modules/factory/attention/init.py | Register XPU attention implementations in the attention factory lists. |
| rtp_llm/models_py/modules/base/xpu/vllm_xpu_ops.py | Add wrapper for optional vllm-xpu-kernels ops with PyTorch fallbacks. |
| rtp_llm/models_py/modules/base/xpu/not_implemented_ops.py | Add XPU stubs for unsupported ops. |
| rtp_llm/models_py/modules/base/xpu/norm.py | Add XPU norm implementations with optional vllm-xpu-kernels acceleration. |
| rtp_llm/models_py/modules/base/xpu/moe_gating.py | Add PyTorch fallback MoE gating op for XPU. |
| rtp_llm/models_py/modules/base/xpu/activation.py | Add XPU fused SiLU-and-mul implementation with optional kernel acceleration. |
| rtp_llm/models_py/modules/base/common/embedding.py | Add fallback path when compiled embedding op is unavailable. |
| rtp_llm/models_py/modules/base/init.py | Wire base module imports for XPU device type. |
| rtp_llm/models_py/bindings/xpu/XpuTorchExt.h | Add XPU-specific torch extension header. |
| rtp_llm/models_py/bindings/xpu/RegisterXpuOps.cc | Register XPU pybind ops entry point. |
| rtp_llm/models_py/bindings/xpu/RegisterXpuBaseBindings.hpp | Provide XPU fallback implementations for key kernels in bindings. |
| rtp_llm/models_py/bindings/xpu/BUILD | Bazel target building XPU bindings. |
| rtp_llm/models_py/bindings/core/ExecOps.h | Add getTorchDevice() API and keep CUDA alias for compatibility. |
| rtp_llm/models_py/bindings/core/ExecOps.cc | Extend runtime sync/event/device/memory queries to XPU. |
| rtp_llm/models_py/bindings/core/CudaSampleOp.cc | Add XPU pure-PyTorch sampling implementation and disable speculative sampling. |
| rtp_llm/models_py/bindings/core/CudaOps.cc | Add XPU implementations for copy and logits masking operations. |
| rtp_llm/models_py/bindings/core/CudaBeamSearchOp.cc | Add PyTorch fallback beam search for XPU. |
| rtp_llm/models_py/bindings/core/BUILD | Update core bindings build graph for XPU selects and SYCL feature flags. |
| rtp_llm/models_py/bindings/common/kernels/BUILD | Disable CUDA-only fuse-copy kernel compilation on XPU. |
| rtp_llm/models_py/bindings/common/FusedCopyOp.cc | Add XPU fallback for fused copy ops using SYCL queue memcpy. |
| rtp_llm/models_py/bindings/common/BUILD | Adjust common bindings build to include/exclude CUDA-only sources on XPU. |
| rtp_llm/models_py/bindings/OpDefs.h | Add XPU KV cache NSHD layout and add position_ids field to attention inputs. |
| rtp_llm/models_py/bindings/OpDefs.cc | Expose new position_ids binding and make decode_cu_seqlens_host read-only. |
| rtp_llm/models/base_model.py | Prefer XPU device string when available. |
| rtp_llm/model_loader/weight_manager.py | Disable CUDA stream usage on XPU and adjust synchronization paths. |
| rtp_llm/model_loader/loader.py | Extend memory cleanup helper to XPU. |
| rtp_llm/frontend/frontend_app.py | Add uvicorn import fallback for loop auto setup. |
| rtp_llm/device/device_type.py | Add XPU device type and detection helper is_xpu(). |
| rtp_llm/device/device_impl.py | Add XPU device implementation and GPU-agnostic helper APIs. |
| rtp_llm/device/init.py | Register XPU device class in device factory. |
| rtp_llm/cpp/utils/TensorDebugUtils.h | Treat XPU tensors like CUDA for debug-dump restrictions. |
| rtp_llm/cpp/utils/ErrorCode.h | Include <string> explicitly (XPU toolchain include differences). |
| rtp_llm/cpp/pybind/th_utils.h | Make CUDA-check macros accept XPU tensors when building for XPU. |
| rtp_llm/cpp/pybind/ComputeInit.cc | Enable exec ctx ops registration for XPU builds. |
| rtp_llm/cpp/pybind/BUILD | Link XPU exec ops and adjust deps for XPU builds. |
| rtp_llm/cpp/normal_engine/speculative/SpeculativeSampler.cc | Use getTorchDevice() and treat XPU like CUDA for host copies. |
| rtp_llm/cpp/normal_engine/speculative/MtpExecutor.cc | Use getTorchDevice() for speculative buffers on XPU-enabled builds. |
| rtp_llm/cpp/normal_engine/speculative/MtpBatchStreamProcessor.cc | Update device transfers and CPU staging for XPU tensors. |
| rtp_llm/cpp/normal_engine/NormalSamplerInputGatherer.cc | Allocate all_probs on getTorchDevice() (CUDA/XPU). |
| rtp_llm/cpp/normal_engine/NormalOutputDispatcher.cc | Move label tensor to getTorchDevice() for loss computation. |
| rtp_llm/cpp/normal_engine/NormalModelInputGatherer.cc | Use getTorchDevice() for multimodal tensors in context batching. |
| rtp_llm/cpp/normal_engine/NormalEngine.cc | Add XPU caching allocator sync/empty-cache and warmup gating. |
| rtp_llm/cpp/models/logits_processor/MultiSeqLogitsProcessor.cc | Move mask to getTorchDevice() for XPU compatibility. |
| rtp_llm/cpp/models/logits_processor/BaseLogitsProcessor.cc | Return vocab mask on getTorchDevice() for XPU compatibility. |
| rtp_llm/cpp/models/eplb/ExpertBalancer.cc | Allocate tensors on getTorchDevice() for XPU compatibility. |
| rtp_llm/cpp/models/Sampler.cc | Switch sampler tensors/transfers to getTorchDevice() and fix variable-beam token copy. |
| rtp_llm/cpp/models/PyWrappedModel.h | Disable CUDA graph/prefill-CP features on XPU; add device sync for XPU. |
| rtp_llm/cpp/models/PyWrappedModel.cc | Generalize host->device tensor staging to getTorchDevice() and treat XPU as device. |
| rtp_llm/cpp/models/ModelTypes.cc | Allocate packed GPU buffers on getTorchDevice() and treat XPU as device. |
| rtp_llm/cpp/models/BUILD | Adjust model library deps for XPU builds (no CUDA graph impl; keep copy op). |
| rtp_llm/cpp/engine_base/stream/GenerateStream.cc | Add XPU generator support; treat XPU like CUDA for CPU staging. |
| rtp_llm/cpp/engine_base/WeightsConverter.cc | Copy tensors to getTorchDevice() for XPU compatibility. |
| rtp_llm/cpp/engine_base/TorchProfiler.h | Enable XPU profiler activity type when building for XPU. |
| rtp_llm/cpp/cache/connector/p2p/transfer/tcp/CudaCopyUtil.cc | Use getTorchDevice() for wrapped raw pointers in copies. |
| rtp_llm/cpp/cache/connector/p2p/LayerBlockConverterImpl.h | Treat XPU like CUDA in BlockInfo device classification. |
| rtp_llm/cpp/cache/connector/memory/KVCacheMemoryConnector.cc | Use getTorchDevice() for mem/gpu block tensor wrappers. |
| rtp_llm/cpp/cache/MemoryLayoutStrategy.cc | Treat XPU device tensors as GPU blocks. |
| rtp_llm/cpp/cache/MemoryEvaluationHelper.cc | Add XPU free/total memory query path. |
| rtp_llm/cpp/cache/KVCacheManager.cc | Treat XPU tensors as device sources/dests in KV updates. |
| rtp_llm/cpp/cache/BlockPool.cc | Allocate device-side block pool on getTorchDevice(); treat XPU as GPU. |
| rtp_llm/config/server_config_setup.py | Extend local world size/device setup to XPU and add fail-fast for XPU speculative decoding. |
| rtp_llm/BUILD | Add XPU-aware wheel requirements filtering and cp312 wheel tag target. |
| deps/requirements_xpu.txt | Add standalone requirements list for XPU environment (Python 3.12, XPU torch index). |
| deps/pip.bzl | Add pip_parse for XPU lockfile and XPU extra-index URL. |
| deps/BUILD | Add target to compile XPU lockfile. |
| bazel/device_defs.bzl | Add XPU test env selection. |
| bazel/defs.bzl | Allow wheel renaming with configurable Python tag (cp312 for XPU). |
| arch_config/arch_select.bzl | Add XPU dependency selection, wheel req filtering/remap/overrides, and torch deps for XPU. |
| WORKSPACE | Add XPU configure rules and torch_xpu repository; load XPU pip deps. |
| BUILD.pytorch | Add using_xpu config and link XPU runtime libraries + python headers for XPU. |
| BUILD | Add using_xpu config_setting. |
| 3rdparty/gpus/xpu_python_utils.bzl | Add helper to resolve symlinked python inside venvs for repo rules. |
| 3rdparty/gpus/xpu_configure.bzl | Add Intel oneAPI/SYCL toolchain auto-configuration and Python 3.12 validation for XPU builds. |
| 3rdparty/gpus/xpu/BUILD.tpl | Add template build targets for SYCL runtime + Level Zero loader. |
| 3rdparty/gpus/torch_xpu_configure.bzl | Add repository rule to locate system-installed PyTorch XPU site-packages. |
| 3rdparty/gpus/crosstool/xpu_cc_toolchain_config.bzl.tpl | Add cc_toolchain_config for SYCL compilation/linking flags. |
| 3rdparty/gpus/crosstool/clang/bin/crosstool_wrapper_driver_xpu.tpl | Add crosstool wrapper routing Bazel C/C++ to icx/icpx with flag filtering. |
| .bazelrc | Add --config=xpu build/test settings for SYCL toolchain, env vars, and Python path. |
Comments suppressed due to low confidence (7)
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1
- XPU sampling reinterprets
top_k(int32) asuint32_t, which breaks semantics for disabled values (e.g.,top_k <= 0). A negativetop_kbecomes a hugeuint32_t, causinghas_top_k/kcomputation to behave incorrectly and potentially calltopk()with unintendedk. Use anint32_t*(orint64_t) view fortop_kchecks/clamping, and avoidreinterpret_cast<uint32_t*>here.
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1 - XPU sampling reinterprets
top_k(int32) asuint32_t, which breaks semantics for disabled values (e.g.,top_k <= 0). A negativetop_kbecomes a hugeuint32_t, causinghas_top_k/kcomputation to behave incorrectly and potentially calltopk()with unintendedk. Use anint32_t*(orint64_t) view fortop_kchecks/clamping, and avoidreinterpret_cast<uint32_t*>here.
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1 - XPU sampling reinterprets
top_k(int32) asuint32_t, which breaks semantics for disabled values (e.g.,top_k <= 0). A negativetop_kbecomes a hugeuint32_t, causinghas_top_k/kcomputation to behave incorrectly and potentially calltopk()with unintendedk. Use anint32_t*(orint64_t) view fortop_kchecks/clamping, and avoidreinterpret_cast<uint32_t*>here.
rtp_llm/models_py/modules/factory/attention/xpu_impl/sdpa.py:1 - There are repeated
.cpu()conversions inside request loops (block_ids_all[req_idx].cpu(),block_ids_all[0].cpu(),block_ids_all[i].cpu()), which can introduce per-iteration overhead and synchronization. Moveblock_ids_allto a CPU tensor once (if needed) before the loop, then index it without further device transfers; likewise, only computebidson CPU once per forward path.
rtp_llm/models_py/modules/factory/attention/xpu_impl/sdpa.py:1 - There are repeated
.cpu()conversions inside request loops (block_ids_all[req_idx].cpu(),block_ids_all[0].cpu(),block_ids_all[i].cpu()), which can introduce per-iteration overhead and synchronization. Moveblock_ids_allto a CPU tensor once (if needed) before the loop, then index it without further device transfers; likewise, only computebidson CPU once per forward path.
rtp_llm/models_py/modules/base/xpu/vllm_xpu_ops.py:1 - Inserting an arbitrary environment-controlled path at the front of
sys.pathcan enable unintended module shadowing/import hijacking. Prefer loading the extension via a controlled mechanism (e.g., validating the path is absolute/expected, warning when enabled, or usingimportlibwith a targeted loader) rather than globally modifying import precedence.
rtp_llm/models_py/modules/base/common/embedding.py:1 - When the compiled
rtp_llm_ops.embeddingis unavailable, the fallback path silently ignorestext_tokens_mask(multimodal masking) and proceeds, which can produce incorrect model outputs. A warning-once is easy to miss in production; consider failing fast whentext_tokens_maskis provided (or implementing mask support in the fallback) to avoid silently wrong results.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/7 · P2/15 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
3060828 to
e9f12a7
Compare
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/3 · P2/14 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
e9f12a7 to
1ba8d0a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 93 out of 97 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (5)
rtp_llm/device/device_impl.py:1
_is_xpu_device()ignores the newRTP_LLM_DEVICE_TYPEoverride logic inget_device_type()and will return true whenevertorch.xpu.is_available()is true, even if the user explicitly forced CUDA. This can route server setup and device selection down the XPU path unexpectedly. Prefer basing this onget_device_type() == DeviceType.Xpu/is_xpu()(or passing an already-resolved DeviceType) so override + detection are consistent everywhere.
rtp_llm/models_py/modules/base/common/embedding.py:1- This fallback silently ignores
text_tokens_mask(only a one-time warning), which can produce incorrect multimodal outputs. This is also inconsistent with the XPU C++ binding added inRegisterXpuBaseBindings.hpp, which hard-fails whentext_tokens_maskis provided. To avoid silent correctness issues, make the Python fallback reject non-emptytext_tokens_mask(raise), or implement equivalent masking behavior in the fallback so semantics match the fused op.
rtp_llm/models_py/modules/factory/attention/init.py:1 - This change drops the previous ordering logic that explicitly kept
XQAImplhigher-priority to avoid token divergence and golden refreshes (per the removed comment). Ifget_xqa_impl()returns a different implementation, decode behavior and numerics can change compared to prior releases. Consider restoring the old behavior: appendXQAImplfirst, then appendget_xqa_impl()only when it differs, so the default remains stable unless explicitly changed.
rtp_llm/frontend/frontend_app.py:1 auto_loop_factoryis not a drop-in replacement forauto_loop_setupin uvicorn; it typically returns a loop implementation rather than performing setup. If the rest of this module expectsauto_loop_setup(...)side effects, this fallback can break event-loop initialization at runtime. Prefer defining a small compatibility wrapper that preserves the expected call semantics (e.g., call the factory and then apply the result to the uvicorn config), or gate on uvicorn version with the correct API for each.
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1- Reinterpreting an
int32_t*buffer asuint32_t*can violate C++ strict-aliasing rules and is undefined behavior in optimized builds. Use the nativeint32_t*pointer and assign1directly (or perform a safe cast per element) to avoid UB while keeping the same comparison/branch behavior.
| elif normalized in _XPU_PACKAGE_REMAP: | ||
| xpu_reqs.append(_XPU_PACKAGE_REMAP[normalized]) | ||
| else: | ||
| xpu_reqs.append(req) |
1ba8d0a to
32d1d4e
Compare
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/2 · P2/21 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
32d1d4e to
f79e7cb
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 93 out of 97 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (4)
rtp_llm/device/device_impl.py:1
gpu_is_available()currently returns true based on the selected device type (includingRTP_LLM_DEVICE_TYPEoverride), not on actual runtime availability. If a user forcesRTP_LLM_DEVICE_TYPE=xpuon a build withouttorch.xpu(or forcescudawhen CUDA isn’t available),gpu_device_count()will raise (torch.xpumissing) or return 0 and downstream code can hit division-by-zero / invalid world-size checks. Fix by gating onhasattr(torch, 'xpu') and torch.xpu.is_available()/torch.cuda.is_available()insidegpu_is_available()andgpu_device_count()(and ideally raise a clear error when an override requests an unavailable backend).
rtp_llm/device/device_impl.py:1- Parsing
ZE_AFFINITY_MASKentries withint(visible[local_rank])will fail for valid Level Zero affinity formats like0.0/0.1(device.tile). IfZE_AFFINITY_MASKcontains tile-qualified entries, this code will throw and prevent startup. Consider parsing by splitting on'.'(taking the device portion) or otherwise supporting tile notation explicitly, and document the expected format.
rtp_llm/models_py/modules/base/common/embedding.py:1 - When
text_tokens_maskis provided, silently ignoring it produces incorrect embeddings for multimodal masked inputs. A warning is easy to miss and turns a correctness requirement into best-effort behavior. Prefer raising a clear exception whentext_tokens_maskis non-empty and the fusedrtp_llm_ops.embeddingop is unavailable, so masked multimodal runs fail fast instead of returning wrong outputs.
rtp_llm/start_backend_server.py:1 _get_cuda_device_list()now returns a generic GPU/XPU-visible list (viaget_visible_device_list()), so the function name is misleading and increases confusion in XPU paths (especially where it later feedsZE_AFFINITY_MASK). Renaming it (e.g.,_get_visible_gpu_device_list) and updating the corresponding local variable names (e.g.,cuda_device_list) would reduce backend-specific ambiguity.
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/7 · P2/14 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
- embedding.py: fail-fast when text_tokens_mask/position_ids/token_types present but rtp_llm_ops.embedding unavailable (no silent wrong output) - server_config_setup.py: route set_device through gpu_set_device() to honor RTP_LLM_DEVICE_TYPE override on mixed XPU+CUDA hosts - base_model.py: use get_device_string() for resolved device type - weight_manager.py: use _is_xpu_device() instead of raw torch.xpu.is_available() for stream/sync selection - start_backend_server.py: hoist XPU world_size>1 fail-fast above device_count branch (covers single-card case); add gpu_device_count()==0 guard against div-by-zero on invalid RTP_LLM_DEVICE_TYPE override - vllm_flash_attn.py: replace weak sum+last prefill cache key with full content digest; add block-table fingerprint to decode cache keys (_write_idx, _flat_bids, _seqused_k) for hybrid-model safety Addresses P1 items #2-#7 from LLLLKKKK's AI code review. Item #1 (per-layer KV gather) is a tracked follow-up: XPU FA2 kernel requires contiguous pages and the interleaved PD layout prevents direct consumption.
- embedding.py: fail-fast when text_tokens_mask/position_ids/token_types present but rtp_llm_ops.embedding unavailable (no silent wrong output) - server_config_setup.py: route set_device through gpu_set_device() to honor RTP_LLM_DEVICE_TYPE override on mixed XPU+CUDA hosts - base_model.py: use get_device_string() for resolved device type - weight_manager.py: use _is_xpu_device() instead of raw torch.xpu.is_available() for stream/sync selection - start_backend_server.py: hoist XPU world_size>1 fail-fast above device_count branch (covers single-card case); add gpu_device_count()==0 guard against div-by-zero on invalid RTP_LLM_DEVICE_TYPE override - vllm_flash_attn.py: replace weak sum+last prefill cache key with full content digest; add block-table fingerprint to decode cache keys (_write_idx, _flat_bids, _seqused_k) for hybrid-model safety Addresses P1 items #2-#7 from LLLLKKKK's AI code review. Item #1 (per-layer KV gather) is a tracked follow-up: XPU FA2 kernel requires contiguous pages and the interleaved PD layout prevents direct consumption.
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/2 · P2/18 · P3/1 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist Violations (6 fail / 56 total)General Principles Checklist
Python Static-First Checklist
Strengths
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 93 out of 97 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (7)
rtp_llm/models_py/bindings/core/CudaOps.cc:1
c10::xpu::getCurrentXPUStream()is typically an XPUStream object, not asycl::queue&. Treating it as asycl::queuewill likely fail to compile on XPU builds. Use the correct API to obtain the underlying SYCL queue (e.g., get a stream object first and then call.queue()/ equivalent) or use a PyTorch-provided copy primitive for device-to-device copies.
rtp_llm/models_py/bindings/common/FusedCopyOp.cc:1- Same issue as in
CudaOps.cc:getCurrentXPUStream()is unlikely to be asycl::queue&, so this XPU fallback will likely not compile. Retrieve the underlying queue from the XPU stream via the correct accessor, or replace this path withat::Tensor/copy_-based copying that doesn't depend on SYCL types.
rtp_llm/models_py/bindings/core/CudaBeamSearchOp.cc:1 Tensor.where(...)is Python-style and is not a stable/portable libtorch C++ API call; on many PyTorch versions this will not compile (C++ typically usestorch::where(condition, a, b)). Replace with the C++ API form to avoid XPU build failures.
rtp_llm/device/device_impl.py:1gpu_is_available()currently returnsTruebased on the resolved device type (including viaRTP_LLM_DEVICE_TYPEoverride), even if that backend is not actually usable (e.g., override set tocudaon a host withtorch.cuda.device_count()==0). This can send the code down GPU-only paths and fail later. Prefer defining availability in terms ofgpu_device_count() > 0(or validate the override maps to an available backend).
rtp_llm/models/base_model.py:1- This returns
cpu:<rank>when running on CPU, which is not a standard device string for PyTorch tensor placement (CPU typically uses justcpu). Consider special-casing CPU to returncpu(no ordinal) while keepingcuda:<rank>/xpu:<rank>for GPU backends.
rtp_llm/models_py/modules/factory/attention/xpu_impl/sdpa.py:1 - This prefill path performs per-request
.cpu()transfers inside a Python loop (bids = block_ids_all[req_idx].cpu()), which can introduce repeated device↔host sync ifblock_ids_allis device-resident. Since you already prefer host block-id tensors (and SDPA code aims to avoid D2H sync), consider normalizingblock_ids_allto a CPU tensor exactly once (or using the existing host copies consistently) before entering the loop.
rtp_llm/start_backend_server.py:1 _get_cuda_device_list()now returns a device-agnostic GPU visible list (CUDA/ROCm/XPU). The function name is misleading after this change. Renaming it to something like_get_visible_gpu_device_list()(and updating call sites within this file) would make the intent clearer and reduce confusion.
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/4 · P2/15 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/5 · P2/20 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 94 out of 98 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (5)
rtp_llm/start_backend_server.py:1
_get_cuda_device_listno longer returns CUDA-only devices (it now returns a device-type-dependent visible list). This name is misleading and makes call sites harder to reason about; rename it (and related variables likecuda_device_list) to something device-agnostic such as_get_gpu_device_list/gpu_device_list.
rtp_llm/models_py/modules/factory/attention/xpu_impl/sdpa.py:1torch.Tensordoes not consistently expose anis_cpuattribute across PyTorch versions/configurations (whereasis_cudais standard). To avoid attribute errors at runtime, usecu_seqlens.device.type == 'cpu'(or equivalent device-type checks) instead ofcu_seqlens.is_cpuin this module (same applies to otheris_cpuuses here).
rtp_llm/models_py/bindings/core/CudaOps.cc:1c10::xpu::getCurrentXPUStream()is used elsewhere in this PR as a stream object with.synchronize(), but here it's treated as asycl::queue&. This inconsistency is likely a compile-time type error (or at minimum relies on a non-obvious implicit conversion). Prefer extracting the underlying SYCL queue explicitly from the XPU stream (per the API provided by the XPU stream type) and use that formemcpy.
rtp_llm/models_py/bindings/common/FusedCopyOp.cc:1- Same issue as in
CudaOps.cc:c10::xpu::getCurrentXPUStream()is treated as asycl::queue&, which likely does not match the actual return type (and is inconsistent with.synchronize()usage elsewhere). Use the proper API to obtain the underlying SYCL queue from the XPU stream before callingmemcpy.
rtp_llm/models_py/modules/factory/attention/xpu_impl/sdpa.py:1 - The new XPU SDPA impl introduces non-trivial gating and behavioral constraints (e.g., rejecting prefix-cache hits, rejecting non-BASE KV cache dtype, rejecting unsupported RoPE styles). The PR adds XPU attention helper tests, but there are no unit tests shown covering these
support()decisions. Add targeted tests to lock in the selection behavior (especially the prefix_lengths>0 rejection) to prevent factory regressions.
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/4 · P2/22 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/3 · P2/15 · P3/0 Blocking IssuesP1
Non-blocking SuggestionsP2
Checklist ✅ (56 items passed)Strengths
|
7c0dc86 to
33aa994
Compare
| if _is_xpu_device(): | ||
| os.environ["ZE_AFFINITY_MASK"] = ",".join(cuda_device_list) | ||
| else: | ||
| os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(cuda_device_list) |
| RTP_LLM_CHECK_WITH_INFO(params.src_ptrs.size() == params.copy_size.size() | ||
| && params.src_ptrs.size() == params.dst_offsets.size(), | ||
| "multiMergeCopy: src_ptrs/copy_size/dst_offsets length mismatch"); | ||
| sycl::queue& queue = c10::xpu::getCurrentXPUStream(); |
| #elif USING_XPU | ||
| // XPU fallback: sequential async memcpy via SYCL queue | ||
| RTP_LLM_CHECK(params.num_copies >= 0 && params.num_copies <= MAX_FUSED_D2D_COPIES); | ||
| sycl::queue& queue = c10::xpu::getCurrentXPUStream(); | ||
| for (int i = 0; i < params.num_copies; ++i) { |
| #elif USING_XPU | ||
| // XPU: async strided memcpy via SYCL queue. | ||
| // When rows are contiguous (stride == row_bytes), merge into a single memcpy | ||
| // to reduce queue submission overhead. | ||
| RTP_LLM_CHECK(params.num_copies >= 0 && params.num_copies <= MAX_FUSED_STRIDED_COPIES); | ||
| sycl::queue& queue = c10::xpu::getCurrentXPUStream(); | ||
| for (int i = 0; i < params.num_copies; ++i) { |
| import torch | ||
|
|
||
| from rtp_llm.device.device_type import DeviceType, get_device_type, is_cuda, is_hip | ||
| from rtp_llm.device.device_type import DeviceType, get_device_type, is_cuda, is_hip, is_xpu |
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/1 · P2/31 · P3/13 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist Violations (2 fail / 56 total)General Principles Checklist
Python Static-First Checklist
Strengths
|
33aa994 to
1f02b91
Compare
AI Code Review - PR #1110Status: BLOCKING Summary: P0/1 · P1/3 · P2/48 · P3/16 Blocking IssuesP0
P1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
3f17c9e to
bfc3261
Compare
| bool has_any_generator = std::any_of( | ||
| params.generator.begin(), params.generator.end(), | ||
| [](const c10::optional<at::Generator>& g) { return g.has_value() && g->defined(); }); |
| // XPU fallback: sequential async memcpy via SYCL queue | ||
| RTP_LLM_CHECK(params.num_copies >= 0 && params.num_copies <= MAX_FUSED_D2D_COPIES); | ||
| sycl::queue& queue = c10::xpu::getCurrentXPUStream(); | ||
| for (int i = 0; i < params.num_copies; ++i) { |
| RTP_LLM_CHECK(params.num_copies >= 0 && params.num_copies <= MAX_FUSED_STRIDED_COPIES); | ||
| sycl::queue& queue = c10::xpu::getCurrentXPUStream(); | ||
| for (int i = 0; i < params.num_copies; ++i) { | ||
| RTP_LLM_CHECK(params.dst[i] != nullptr && params.src[i] != nullptr); |
| void multiMergeCopy(const MultiMergeCopyParams& params) { | ||
| RTP_LLM_CHECK_WITH_INFO(params.dst_ptr != nullptr, "multiMergeCopy: dst_ptr is null"); | ||
| RTP_LLM_CHECK_WITH_INFO(params.src_ptrs.size() == params.copy_size.size() | ||
| && params.src_ptrs.size() == params.dst_offsets.size(), | ||
| "multiMergeCopy: src_ptrs/copy_size/dst_offsets length mismatch"); | ||
| sycl::queue& queue = c10::xpu::getCurrentXPUStream(); | ||
| for (size_t i = 0; i < params.src_ptrs.size(); i++) { |
| # XPU lockfile was generated with Python 3.12 (PyTorch XPU requires ==3.12). | ||
| # pip_parse only declares the hub repo and parses the hashed lockfile; it | ||
| # does not download wheels, so declaring it in every container is cheap. | ||
| # The actual whl_library fetches DO run the interpreter and would fail on a | ||
| # Python 3.10 container (e.g. scikit-learn==1.8.0 is an XPU-only transitive | ||
| # pin that Requires-Python>=3.11). Those fetches are gated by xpu_pip_gate | ||
| # below on TF_NEED_XPU, so `bazel sync` / non-XPU builds never resolve the | ||
| # XPU wheels. | ||
| pip_parse( | ||
| name = "pip_xpu_torch", | ||
| requirements_lock = "@rtp_deps//:requirements_lock_xpu.txt", | ||
| python_interpreter = "/opt/conda310/bin/python3", | ||
| extra_pip_args = PIP_EXTRA_ARGS + ["--extra-index-url=https://download.pytorch.org/whl/xpu"], | ||
| timeout = 3600, | ||
| ) |
AI Code Review - PR #1110Status: LGTM Summary: P0/0 · P1/0 · P2/37 · P3/12 lgtm ready to ci Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
bfc3261 to
ace9aa5
Compare
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/6 · P2/35 · P3/16 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
- ModelTypes.cc / MtpBatchStreamProcessor.cc / MtpExecutor.cc / KVCacheManager.cc: wrap all .pin_memory() call sites (5 + 9 + 1 + 1) so XPU TP/speculative-decode paths no longer call cudaHostAlloc and crash. - vllm_flash_attn.py batched prefill: replace silent integer-division truncation with divmod + RuntimeError when block_ids is not evenly divisible by num_reqs, preventing silent KV cache misalignment. - loader.py _load_from_fastsafetensor: replace two raw torch.cuda.empty_cache() calls (silently skipped on XPU) with ModelLoader.force_clean_cuda_memory() which already handles XPU via torch.xpu.empty_cache(), avoiding OOM on large model loads.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 92 out of 96 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (10)
rtp_llm/models/base_model.py:1
get_device_string()can return "cpu", which would produce device strings like "cpu:1" whenlocal_rank != 0. That is not a valid/meaningful CPU device and may break CPU fallback paths. Consider returning just "cpu" for CPU (no index), and only appending:local_rankfor real multi-device backends (cuda/xpu).
rtp_llm/models/base_model.py:1get_device_string()can return "cpu", which would produce device strings like "cpu:1" whenlocal_rank != 0. That is not a valid/meaningful CPU device and may break CPU fallback paths. Consider returning just "cpu" for CPU (no index), and only appending:local_rankfor real multi-device backends (cuda/xpu).
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1- Reinterpreting
top_kfrom int32 -> uint32 breaks the common "disabled" semantics for values <= 0 (e.g., -1 becomes a huge uint32). This will incorrectly turn on top-k filtering and also breaks thet <= 0checks. Keeptop_kasint32_t*throughout and perform<= 0logic on signed values.
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1 - Reinterpreting
top_kfrom int32 -> uint32 breaks the common "disabled" semantics for values <= 0 (e.g., -1 becomes a huge uint32). This will incorrectly turn on top-k filtering and also breaks thet <= 0checks. Keeptop_kasint32_t*throughout and perform<= 0logic on signed values.
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1 - Reinterpreting
top_kfrom int32 -> uint32 breaks the common "disabled" semantics for values <= 0 (e.g., -1 becomes a huge uint32). This will incorrectly turn on top-k filtering and also breaks thet <= 0checks. Keeptop_kasint32_t*throughout and perform<= 0logic on signed values.
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1 - The degenerate-row fallback is intended to use argmax of the original logits, but
params.logitshas already been overwritten with softmax probabilities (params.logits.copy_(probs_t)). This makes the fallback behave unexpectedly for NaN/Inf rows (and generally deviates from the stated intent). Preserve a copy of the pre-softmax logits (or computefallbackbefore overwriting) and use that for the argmax fallback.
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1 - The degenerate-row fallback is intended to use argmax of the original logits, but
params.logitshas already been overwritten with softmax probabilities (params.logits.copy_(probs_t)). This makes the fallback behave unexpectedly for NaN/Inf rows (and generally deviates from the stated intent). Preserve a copy of the pre-softmax logits (or computefallbackbefore overwriting) and use that for the argmax fallback.
rtp_llm/models_py/modules/base/xpu/vllm_xpu_ops.py:1 - This uses a Python
assertfor input validation. Asserts can be disabled with Python optimizations (-O), which would skip the check and lead to harder-to-debug failures later. Prefer raisingValueError/RuntimeErrorwith the same message.
rtp_llm/start_backend_server.py:1 - The raised
Exceptionmessage is grammatically unclear and includes a backslash line continuation inside the f-string, which makes the output hard to read. Consider raising a more specific exception type (e.g.,ValueError) with a single-line message like: "WORLD_SIZE {world_size} must be a multiple of local device count {device_count}".
rtp_llm/models_py/utils/arch.py:1 is_xpuis imported but not used in this module (based on the shown diff). Removing it would avoid unused-import lint failures and reduce confusion about supported backends for SM queries.
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/1 · P2/30 · P3/14 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
…ss log Replace hardcoded torch.cuda.is_available/memory_allocated/memory_reserved with gpu_is_available() and a device-aware branch so XPU also emits the fastsafetensor loading progress log with correct memory figures. Fixes: weight loader progress log hard-codes torch.cuda, XPU reports no memory info (loader.py:420).
AI Code Review - PR #1110Status: BLOCKING Summary: P0/0 · P1/2 · P2/37 · P3/16 Blocking IssuesP1
Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
…ty ZE_AFFINITY_MASK
1. Decode hot path: compute _seq_fp and _table_hash only at step start
(layer_idx==0) and reuse cls._step_seq_fp / cls._step_table_hash for
subsequent layers. Also replace kv_lens hash with _seq_fp since
kv_lens = seq_lens + 1 is deterministic. Eliminates ~108 redundant
numpy().tobytes() calls per decode step on a 36-layer model.
2. ZE_AFFINITY_MASK: guard against empty string ("") which causes
int("") ValueError crash. Use 'xpu_mask.strip()' check consistent
with CUDA_VISIBLE_DEVICES handling.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 92 out of 96 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (5)
rtp_llm/models_py/bindings/core/CudaSampleOp.cc:1
params.generatoris iterated asc10::optional<at::Generator>, but later accessed as if it were a plainat::Generator(params.generator[b].defined()). This is a compile-time error and also risks passing an unset generator totorch::multinomial. Fix by checkinghas_value()(and->defined()if needed) and passing either*params.generator[b]/params.generator[b].value()(or the optional itself if the multinomial overload expects an optional) consistently.
rtp_llm/models_py/bindings/core/CudaOps.cc:1c10::xpu::getCurrentXPUStream()is used as if it returnssycl::queue&, but in PyTorch it is typically an XPU stream wrapper type (with a.queue()accessor) rather than asycl::queueitself. This is likely a compile error and/or ABI mismatch. Update the code to obtain the queue from the stream wrapper (and include the appropriate SYCL headers only if required by the returned type).
rtp_llm/models_py/bindings/common/FusedCopyOp.cc:1- Same issue as in
CudaOps.cc: treatingc10::xpu::getCurrentXPUStream()as asycl::queue&is likely incorrect and may not compile depending on the PyTorch XPU API. Retrieve the underlying SYCL queue from the XPU stream object (or use a PyTorch-provided copy primitive) and ensure the correct headers/types are used.
rtp_llm/device/device_impl.py:1 ZE_AFFINITY_MASK/CUDA_VISIBLE_DEVICESentries may contain whitespace (e.g.'0, 1') or empty tokens (trailing comma). Since other code converts these strings toint(e.g., XPU device-id derivation), this can raiseValueErrorat runtime. Strip each entry and filter out empty strings before returning the list.
rtp_llm/start_backend_server.py:1- The helper is now GPU-agnostic (CUDA/ROCm/XPU) but is still named
_get_cuda_device_list(). Rename it (and any related variables likecuda_device_list) to reflect the generalized behavior (e.g.,_get_gpu_device_list) to avoid confusion when debugging XPU masking (ZE_AFFINITY_MASK) vs CUDA (CUDA_VISIBLE_DEVICES).
| elif normalized in _XPU_PACKAGE_REMAP: | ||
| xpu_reqs.append(_XPU_PACKAGE_REMAP[normalized]) | ||
| else: | ||
| xpu_reqs.append(req) |
AI Code Review - PR #1110Status: LGTM Summary: P0/0 · P1/0 · P2/30 · P3/14 lgtm ready to ci Non-blocking SuggestionsP2
P3
Checklist ✅ (56 items passed)Strengths
|
|
@LLLLKKKK 能提供一下 CI build-ppu FAILED的具体错误信息吗: {"jobId":"72417425","jobName":"build-ppu","rawMeta":"{}","status":"FAILED"}? |
The pip_xpu_torch lockfile embeds --extra-index-url pointing to download.pytorch.org/whl/xpu. When arch_select.bzl loaded directly from @pip_xpu_torch, it triggered the pip_parse repo rule on ALL builds (including PPU). On internal CI machines that cannot reach download.pytorch.org, this caused an immediate (~40s) failure. Fix: route the requirement() function through @xpu_pip_gate (which already gates install_deps). On non-XPU builds, the gate returns a dummy label pointing to a local py_library target, so @pip_xpu_torch is never accessed and its repo rule never executes.
Overview
Add Intel GPU (XPU) inference support to RTP-LLM, reusing vllm-xpu-kernels to optimize performance on Intel GPU.
The base environment is the intel/vllm Docker image.
Guiding principles:
--config=xpu— DO NOT break existing code logicChanges
1. Build Infrastructure
xpu_configure.bzl(analogous tocuda_configure).bazelrc --config=xpupreset with oneAPI compiler flags2. C++ Device Generalization
select()branches in BUILD filesxpu_sycl_compilefeature flag3. Python Device & Attention
4. Module Factories & Server Integration
Test Environment
How to Build