Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
c9b9bf1
build: migrate OSS packaging to pyproject and setup.py
LLLLKKKK May 13, 2026
2f444b7
fix(import): make python-native platform bootstrap safe
LLLLKKKK May 13, 2026
211b9bd
test: migrate OSS unit tests to pytest profiles
LLLLKKKK May 13, 2026
45175bd
test(smoke): port smoke and perf suites to pytest
LLLLKKKK May 13, 2026
26bf22c
feat(remote): add REAPI pytest execution
LLLLKKKK May 13, 2026
cbeb615
ci: run perf profile on remote h20
LLLLKKKK May 13, 2026
692bf18
fix(package): address packaging review issues
LLLLKKKK May 13, 2026
5ce731e
ci: run gb200 eval bench via pytest
LLLLKKKK May 13, 2026
34994a6
fix(smoke): register tau2 bench comparer
LLLLKKKK May 13, 2026
4c6906f
fix(ci): route cpp utility tests to A10
LLLLKKKK May 13, 2026
dce3a2f
fix(ci): pin tokenizers for transformer deps
LLLLKKKK May 13, 2026
705effc
fix(smoke): use bundled tau2 data in ci
LLLLKKKK May 13, 2026
82cb500
build: precompile bytecode during package installs
LLLLKKKK May 14, 2026
f4f43be
feat: merge main into feature/python_native_v2
baohengyi Jun 18, 2026
c591430
fix: address P0/P1 review findings
baohengyi Jun 18, 2026
75656a0
fix: address P1 C++ review findings
baohengyi Jun 18, 2026
7b6e403
fix: address remaining P0/P1 review findings
baohengyi Jun 18, 2026
a468563
fix: address P1/P2 PR review findings
baohengyi Jun 18, 2026
16591ed
Merge branch 'upstream/main' into feature/python_native_v2
baohengyi Jun 21, 2026
a073b14
fix: restore py_standalone_testlib BUILD target
baohengyi Jun 21, 2026
2a93558
docs: migrate README_cn install instructions to pyproject extras
baohengyi Jun 21, 2026
c437f25
fix: address P0/P1 PR review findings
baohengyi Jun 22, 2026
49204c8
fix: address P1/P2 PR review findings
baohengyi Jun 22, 2026
53e06d2
fix(pr-review): address P0/P1 issues from review round
baohengyi Jun 22, 2026
ce1102a
fix(blocking): address P0/P1 issues from PR review
baohengyi Jun 22, 2026
936353f
fix(p2): address non-blocking suggestions from PR review
baohengyi Jun 22, 2026
7b0c35c
fix(pr-review): address P1/P2 issues from review round 3
baohengyi Jun 22, 2026
56ee946
fix: resolve PR review P1 blockers and P2 suggestions
baohengyi Jun 23, 2026
7dbfb9f
fix(build): address infra/config review P2 issues
baohengyi Jun 23, 2026
387a86f
fix: resolve PR review blocking P0/P1 issues
baohengyi Jun 23, 2026
674ea87
fix(p2): address non-blocking P2 suggestions from PR review round 7
baohengyi Jun 23, 2026
1e1ed6f
fix: address P0/P1/P2 issues from PR review round 8
baohengyi Jun 24, 2026
df79502
fix: address P0/P1 blocking issues from PR review round 9
baohengyi Jun 24, 2026
3cbdeee
fix(p2): address P2/P3 non-blocking suggestions from review round 9
baohengyi Jun 24, 2026
c2491d4
fix: address review round 10 issues
baohengyi Jun 24, 2026
f4ebbb9
Merge remote-tracking branch 'upstream/main' into feature/python_nati…
baohengyi Jun 26, 2026
4fe7d38
fix(perf): propagate per-host errors in multi_runner.sh subshells
baohengyi Jun 26, 2026
52209da
refactor(perf): use set -e uniformly in multi_runner.sh per-host subs…
baohengyi Jun 27, 2026
44cd636
fix(build): exclude flydsl_*.py from :fla srcs to unbreak ROCm analysis
baohengyi Jun 27, 2026
b5c14cc
fix(review): address review batch (correctness, perf, safety)
baohengyi Jun 28, 2026
3ae8f15
fix(review): pickle back-compat, XQA fixes, test correctness & cleanup
baohengyi Jun 29, 2026
c018752
Merge remote-tracking branch 'upstream/main' into feature/python_nati…
baohengyi Jun 30, 2026
bdce122
fix(smoke): narrow mainse comparer predicates to per-mode sub-flags
baohengyi Jun 30, 2026
c825a37
fix(review): follow-up polish from review round (correctness & cleanup)
baohengyi Jun 30, 2026
386ebfe
Merge remote-tracking branch 'upstream/main' into feature/python_nati…
baohengyi Jul 1, 2026
e70f6a2
fix: restore XQA backend NAME and add torch auto-detect fallback
baohengyi Jul 1, 2026
0b1d0be
refactor(config): strict pickle field-count check for FMHAConfig/MoeC…
baohengyi Jul 1, 2026
ad0304a
fix: remove dead FMHAType import from xqa.py
baohengyi Jul 1, 2026
5c7ec94
fix(flashinfer): assert replay batch_size <= captured in cuda graph r…
baohengyi Jul 1, 2026
865014c
Merge branch 'main' into feature/python_native_v2
baohengyi Jul 2, 2026
49863fd
fix: scheduler fairness, bazelrc test LD_LIBRARY_PATH, ops .so fallback
baohengyi Jul 3, 2026
521f14e
fix(review): dead flag, aiter-decode gating, dataclass fields, dup in…
baohengyi Jul 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 2 additions & 1 deletion .bazelignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
internal_source/rdma
open_source
open_source
bazel-github-opensource
52 changes: 37 additions & 15 deletions .bazelrc
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@ common --experimental_downloader_config=bazel/bazel_downloader.cfg
fetch:downloader --experimental_remote_downloader=grpc://com.taobao.search.bazel.server:9092 --remote_cache=grpc://com.taobao.search.bazel.server:9092


build --python_top=//:python310 --incompatible_use_python_toolchains=false # force use /opt/conda310/bin/python3

build --spawn_strategy=local # avoid nvcc conflicts
build --action_env PYTHON_BIN_PATH="/opt/conda310/bin/python3"
build --cxxopt="-std=c++17" --copt="-DGTEST_USE_OWN_TR1_TUPLE=0" --copt="-DEIGEN_MAX_CPP_VER=11"
Expand All @@ -30,8 +28,8 @@ build --define=xft_use_icx=true
build --define=frontend=false
build --linkopt -ldl # exception backtrace
build --define=NVSHMEM_DIR=/
build --action_env PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260321"
build --host_action_env PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260321"
build --action_env PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260423"
build --host_action_env PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260423"

build:cuda --copt="-DENABLE_BF16=1"
build:cuda --copt="-DBUILD_CUTLASS_MIXED_GEMM=ON"
Expand All @@ -56,6 +54,9 @@ build:cuda --action_env TF_NCCL_VERSION="2"
build:cuda --action_env CUDNN_INSTALL_PATH="/usr/local/cuda/"

# 6.0 = P100, 7.0 = V100, 7.5 = T4, 8.6 = A10, 8.0 = A100 8.9 = L, 9.0 = H800
# NOTE: cuda12 is a shared base config. Do not use it directly; it lacks the
# platform-specific torch_deps/torch_libs selection. Use cuda12_6, cuda12_9, or
# cuda12_9_arm instead.
build:cuda12 --config=cuda
build:cuda12 --action_env TF_CUDA_COMPUTE_CAPABILITIES="7.0,7.5,8.0,8.6,8.9,9.0"
build:cuda12 --host_action_env TF_CUDA_COMPUTE_CAPABILITIES="7.0,7.5,8.0,8.6,8.9,9.0"
Expand Down Expand Up @@ -168,14 +169,14 @@ build:rocm --copt="-D_GLIBCXX_USE_CXX11_ABI=1"
#build:rocm --action_env HIPBLASLT_LOG_MASK="32" # to log blasLt
#build:rocm --action_env ROCBLAS_LAYER="4"
#build:rocm --copt="-DENABLE_PROF=1"
build:rocm --action_env LD_LIBRARY_PATH="/opt/rh/gcc-toolset-12/root/usr/lib64:/lib64:/opt/conda310/lib/:/opt/rocm/lib/:/opt/taobao/java/jre/lib/amd64/server/:/opt/amdgpu/lib64/"
build:rocm --host_action_env LD_LIBRARY_PATH="/opt/rocm/lib/:/opt/taobao/java/jre/lib/amd64/server/:/opt/amdgpu/lib64/"
build:rocm --action_env PATH="/opt/rh/gcc-toolset-12/root/usr/bin:/opt/conda310/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260321"
build:rocm --host_action_env PATH="/opt/rh/gcc-toolset-12/root/usr/bin:/opt/conda310/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260321"
build:rocm --action_env LD_LIBRARY_PATH="/opt/rh/gcc-toolset-12/root/usr/lib64:/lib64:/opt/rocm/lib/:/opt/amdgpu/lib64/"
build:rocm --host_action_env LD_LIBRARY_PATH="/opt/rh/gcc-toolset-12/root/usr/lib64:/lib64:/opt/rocm/lib/:/opt/amdgpu/lib64/"
build:rocm --action_env PATH="/opt/rh/gcc-toolset-12/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260423"
build:rocm --host_action_env PATH="/opt/rh/gcc-toolset-12/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260423"
build:rocm --copt="-DUSE_ROCM=1"
build:rocm --copt="-DFORCE_NVSHMEM_API=1"
build:rocm --copt="-DFLASHINFER_CUB_SUBTRACTLEFT_DEFINED" # ROCm 7.2 hipcub uses SubtractLeft instead of FlagHeads
test:rocm --test_env="/opt/rocm/lib/:/opt/taobao/java/jre/lib/amd64/server/:/opt/amdgpu/lib64/"


build:arm --copt="-DENABLE_BF16=1"
build:arm --define=using_cuda=false --define=using_cuda_nvcc=false
Expand Down Expand Up @@ -213,20 +214,41 @@ build:asan --copt -fPIC # for "fix relocation truncated to fit: R_X86_64_32 agai
build:asan --copt -fdebug-types-section # for "fix relocation truncated to fit: R_X86_64_32 against `.debug_info'" collect2 error
build:asan --linkopt -fsanitize=address

test:rocm --test_env PATH="/opt/rocm/bin:/opt/rh/gcc-toolset-12/root/usr/bin:/opt/conda310/bin:/opt/conda310/condabin:/usr/share/Modules/bin:/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/opt/cmake/cmake-3.26.4/bin"
test:rocm --test_env LD_LIBRARY_PATH="/opt/rh/gcc-toolset-12/root/usr/lib64:/opt/rocm/lib:/opt/conda310/lib/:/usr/lib64:/opt/amdgpu/lib64"
test --test_env LD_LIBRARY_PATH="/opt/rocm/lib:/opt/conda310/lib/:/usr/local/nvidia/lib64:/usr/lib64:/usr/local/cuda/lib64:/opt/amdgpu/lib64:/usr/local/cuda/extras/CUPTI/lib64"
test --test_env OMP_NUM_THREADS=8
test --test_env FT_SERVER_TEST="1"
test --test_env LOG_LEVEL="INFO"
test --test_env NCCL_DEBUG="INFO"
# for dp scatter add stable result
test --test_env TZ=Asia/Shanghai
# Global minimal test LD_LIBRARY_PATH so platforms without a dedicated stanza
# (cpu / arm / ppu) can still find the conda310 python/torch runtime libs.
# The cuda/rocm/compat stanzas below set fuller paths and override this for
# their --config (single-valued --test_env: last applied wins).
test --test_env LD_LIBRARY_PATH="/opt/conda310/lib/:/usr/lib64:/lib64"
# for flashinfer find cuda; scope to CUDA-specific test runs
test:cuda --test_env CUDA_HOME=/usr/local/cuda

# NOTE: /opt/conda310/bin MUST stay on PATH — many cpp tests use
# `#!/usr/bin/env python3` shebangs / subprocess invocations and conda310
# is the only python3 in the dev images. Dropping it caused
# `/usr/bin/env: 'python3': No such file or directory` to fail all 54 cpp-ut
# tests on REAPI workers (PR 965 run 39271211 cpp-ut job).
test:cuda --test_env PATH="/opt/conda310/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
test:cuda --test_env LD_LIBRARY_PATH="/opt/conda310/lib/:/usr/local/nvidia/lib64:/usr/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"

# ROCm: gcc-toolset-12 lib64 MUST come first so dlopen of aiter JIT-built
# .so finds GCC 12 libstdc++ (has __cxx11::ostringstream::str() const).
# conda310 lib AFTER gcc-toolset-12 to avoid clobbering libstdc++ resolution
# but BEFORE rocm/amdgpu so python finds its own runtime.
test:rocm --test_env LD_LIBRARY_PATH="/opt/rh/gcc-toolset-12/root/usr/lib64:/opt/conda310/lib/:/lib64:/opt/rocm/lib/:/opt/amdgpu/lib64/"
test:rocm --test_env PATH="/opt/rh/gcc-toolset-12/root/usr/bin:/opt/conda310/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/cache_bust_20260423"

# compat for low driver < 535
test:compat --test_env LD_LIBRARY_PATH="/opt/rocm/lib:/opt/conda310/lib/:/usr/local/cuda/compat/:/usr/local/nvidia/lib64:/usr/lib64:/usr/local/cuda/lib64:/opt/amdgpu/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64/stubs"

# internal source overrides (only loaded when internal_source symlink exists)
try-import %workspace%/internal_source/.internal_bazelrc
try-import %workspace%/internal_source/.cicd_bazelrc
# internal source overrides (only loaded in the internal monorepo checkout)
try-import %workspace%/../internal_source/.internal_bazelrc
# torch path from host environment (generated by setup.py)
try-import %workspace%/.torch_bazelrc
# always use for aone ci
try-import %workspace%/../internal_source/.cicd_bazelrc
22 changes: 20 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,12 +1,26 @@
*~
*.o
__pycache__/

# Local venv: use $HOME/venvs/<name> (see scripts/create_venv.sh). .venv/ is uv/IDE default when
# running `uv run`/`uv sync` in-tree without UV_PROJECT_ENVIRONMENT — ignore so it never gets committed.
.venv/
venv/
.vscode
.idea
./translation
.cache
*.npy
*.pth
.torch_bazelrc
*.egg-info/

# Build artifacts - entire libs/ dir is generated by setup.py build_ext
rtp_llm/libs/
# Generated .pyi stubs (created by setup.py build_ext via pybind11_stubgen)
rtp_llm/ops/**/*.pyi
smoke_actual/
main_logs/
!tests/data/**/*.npy
/notebooks
**/.ipynb_checkpoints/
Expand Down Expand Up @@ -42,8 +56,11 @@ package/rtp_llm*.whl
package/maga-ci.Dockerfile
rtp_llm/test-data/*

model_rpc_service_pb2.py
model_rpc_service_pb2_grpc.py
# Protobuf / gRPC generated Python (grpc_tools.protoc / setup.py build_ext)
rtp_llm/cpp/model_rpc/proto/*_pb2.py
rtp_llm/cpp/model_rpc/proto/*_pb2_grpc.py
rtp_llm/test/remote_tests/*_pb2.py
rtp_llm/test/remote_tests/*_pb2_grpc.py
rtp_llm/cpp/deep_gemm/cutlass_hdr

cuda_profiler*.json
Expand All @@ -69,6 +86,7 @@ git.diff

*.mo
*build*/
!_build/
.doctrees/
node_modules/
docs/index/dist/
Expand Down
5 changes: 2 additions & 3 deletions 3rdparty/BUILD
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
load("//:def.bzl", "rpm_library", "copts", "cuda_copts")
load("@arch_config//:arch_select.bzl", "torch_deps")
load("//:def.bzl", "copts", "cuda_copts")
package(default_visibility = ["//visibility:public"])

cc_library(
Expand All @@ -13,7 +12,7 @@ cc_library(
"@local_config_cuda//cuda:cuda_driver",
],
copts = copts(),
visibility = ["//visibility:public"],
visibility = ["//visibility:public"],
)

cc_library(
Expand Down
51 changes: 0 additions & 51 deletions 3rdparty/accl_ep/BUILD

This file was deleted.

2 changes: 1 addition & 1 deletion 3rdparty/contextFusedMultiHeadAttention/BUILD
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
load("//:def.bzl", "rpm_library", "copts", "cuda_copts",)
load("@local_config_cuda//cuda:build_defs.bzl", "cuda_default_copts_without_arch", "if_cuda")
load("@arch_config//:arch_select.bzl", "torch_deps")
load("@rtp_llm//bazel:defs.bzl", "torch_deps")
load("@rules_cc//examples:experimental_cc_shared_library.bzl", "cc_shared_library")
package(default_visibility = ["//visibility:public"])

Expand Down
2 changes: 1 addition & 1 deletion 3rdparty/contextFusedMultiHeadAttentionSm70/BUILD
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
load("//:def.bzl", "rpm_library", "copts", "cuda_copts",)
load("@arch_config//:arch_select.bzl", "torch_deps")
load("@rtp_llm//bazel:defs.bzl", "torch_deps")
package(default_visibility = ["//visibility:public"])

cc_library(
Expand Down
64 changes: 0 additions & 64 deletions 3rdparty/deep_gemm/0001-fix-smem_buffer.patch

This file was deleted.

Loading