fix: pin cache store workers to local device by Vinkle-hzt · Pull Request #1116 · alibaba/rtp-llm

Vinkle-hzt · 2026-06-17T10:16:00Z

Summary

Fix cache store background workers to run on the correct local GPU device.

Cache store work can run from background thread pools and RPC callbacks. These threads do not always inherit the expected CUDA/HIP current device, which can lead to cache store operations using the wrong device in multi-GPU deployments. This PR passes the local rank device id into cache store components and pins worker threads before they execute device-sensitive work.

Changes

Add pinThreadToDeviceOnce() utility for CUDA/ROCm device pinning.
Propagate local_rank into cache store init params and async cache writer.
Pin cache store readiness, store, load, TCP load callback, and async writer tasks to the configured local device.
Keep device pinning no-op for invalid device ids and non-GPU builds.

Testing

Not run locally.

LLLLKKKK · 2026-06-17T10:27:45Z

AI Code Review - PR #1116

Status: LGTM

Summary: P0/0 · P1/0 · P2/2 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

设备切换失败后仍继续执行后台拷贝 @ rtp_llm/cpp/utils/DevicePin.h:26
- 建议：将 pin 失败改为 fail-fast，或让 helper 返回 bool/Status 并让调用方终止该后台任务并返回明确错误。
缺少非默认 GPU pin 的回归覆盖 @ rtp_llm/models_py/bindings/core/CacheStoreAsyncWriter.cc:46
- 建议：补充 CUDA/ROCm gated 或可 mock 的单测，构造非默认 device_id，并在后台任务内断言 current device 正确；必要时覆盖 Tcp load 的 GPU copy 路径。

Checklist Violations (6 fail / 34 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue 设备切换失败后仍继续执行后台拷贝
设备 pin 失败后 helper 直接返回，调用方状态仍视为已 pin 并继续执行后台拷贝。
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue 设备切换失败后仍继续执行后台拷贝
cudaSetDevice/hipSetDevice 失败是配置或设备状态错误，当前仅 WARNING 后静默 fallback 到线程原有 device。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 缺少非默认 GPU pin 的回归覆盖
diff_paths 没有测试变更，现有 async writer 测试只走 device_id=-1 默认路径。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 缺少非默认 GPU pin 的回归覆盖
改动同时涉及 CUDA/ROCm 和多卡 local_rank 语义，但没有看到对应平台 gated 验证。

RTP-LLM Checklist

[F] 跨平台 — ROCm 路径错误处理非静默 → issue 设备切换失败后仍继续执行后台拷贝
hipSetDevice 失败只 WARNING 后 return，后续任务继续执行。
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 缺少非默认 GPU pin 的回归覆盖
新增后台线程 device pin 行为没有随 PR 增加测试覆盖。

Strengths

device_id 从 RemoteRpcServer、CacheStore/Messager 到 CacheStoreAsyncWriter 的传播链较完整，默认值 -1 也保留了旧调用路径兼容性。

LLLLKKKK · 2026-06-18T06:30:11Z

AI Code Review - PR #1116

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

TCP 服务端/transfer 回调仍未绑定 device @ rtp_llm/cpp/disaggregate/cache_store/TcpMessager.cpp:25
- 建议：将 device_id 继续传入 TcpCacheStoreServiceImpl/TcpTransferConnection/TcpBlockReadClosure，并在 blockReadImpl 与 TcpBlockReadClosure::Run 的 GPU copy 前调用 pinThreadToDeviceOnce。

Non-blocking Suggestions

P2

测试覆盖缺口：device pin 的有效设备路径未验证 @ rtp_llm/cpp/utils/DevicePin.h:15
- 建议：补充 CUDA/ROCm 条件测试：用有效 local_rank 创建 writer 或 cache store 任务，并在后台线程断言当前 device 正确。

Checklist Violations (3 fail / 46 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue TCP 服务端/transfer 回调仍未绑定 device
普通 load closure 已 pin，但 TCP 服务端和 transfer/blockRead 的 RPC 回调线程仍直接执行 GPU copy，device 状态不变量未覆盖所有异步入口。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 测试覆盖缺口：device pin 的有效设备路径未验证
新增有效 device 绑定路径未见测试覆盖，现有 CacheStoreAsyncWriterTest 默认 device_id=-1 会绕过 pinThreadToDeviceOnce。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → checklist-only
变更涉及 CUDA/ROCm 和多 local_rank 后台线程，但 diff 未增加对应覆盖；主要风险已由 device pin 遗漏和测试 issue 覆盖。

Strengths

新增 device_id 默认值为 -1，旧测试和未配置路径可以保持原行为。
新增 DevicePin 将后台线程 device 选择集中到统一工具函数，调用点复用同一实现。

LLLLKKKK · 2026-06-22T03:09:45Z

AI Code Review - PR #1116

Status: BLOCKING

Summary: P0/0 · P1/1 · P2/1 · P3/0

Blocking Issues

P1

设备 pin 异常会绕过异步 writer 的异常收集 @ rtp_llm/models_py/bindings/core/CacheStoreAsyncWriter.cc:47
- 建议：把 pinThreadToDeviceOnce 纳入 try/catch，并用 RAII/scope guard 保证 pending_count_-- 与 wait_cv_ notify 在所有退出路径执行，再通过 stored_exception_ 传播异常。

Non-blocking Suggestions

P2

缺少非 0 device 的后台 cache store 覆盖 @ rtp_llm/cpp/utils/DevicePin.h:15
- 建议：增加带 device_id=local_rank 的多 GPU guarded UT 或集成用例，验证后台 store/load/read closure 在非 0 device 上读写正确。

Checklist Violations (6 fail / 38 total)

General Principles Checklist

[6.1] Architecture — 状态不变量：创建/更新/失败/重试/回滚路径有效 → issue 设备 pin 异常会绕过异步 writer 的异常收集
CacheStoreAsyncWriter.cc:47 的 pinThreadToDeviceOnce 在 try/catch 外执行，失败时不会维护 pending_count/stored_exception_ 不变量。_
[6.1] Architecture — 错误语义：fail-fast/retry/fallback/silent 行为显式 → issue 设备 pin 异常会绕过异步 writer 的异常收集
async writer 已设计为捕获后台异常并在 waitAllDone 重抛，但新增设备 pin 失败会绕过该异常传播语义。
[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 缺少非 0 device 的后台 cache store 覆盖
diff 未包含测试；现有 closure/messager 相关测试仍主要走默认 device_id=-1 或 device 0，未覆盖 device pin 生效或失败路径。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 缺少非 0 device 的后台 cache store 覆盖
变更包含 CUDA/ROCm 分支和多卡后台线程设备绑定，但 diff 未新增非 0 local_rank 或跨平台 guarded 覆盖。

RTP-LLM Checklist

[C] 线程安全与并发 — Worker 线程异常未传播到调用方 → issue 设备 pin 异常会绕过异步 writer 的异常收集
CacheStoreAsyncWriter.cc:47 的 pinThreadToDeviceOnce 在 try 外，异常不会进入 stored_exception，也不会减少 pending_count_。_
[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 缺少非 0 device 的后台 cache store 覆盖
新增非 0 device 绑定路径未见端到端或 guarded UT 覆盖。

Strengths

device_id 从 RemoteRpcServer/PyWrappedModel 贯通到 cache store、TCP closure 和 async writer，主要后台线程入口都有显式 pin。

LLLLKKKK · 2026-06-22T03:51:01Z

AI Code Review - PR #1116

Status: LGTM

Summary: P0/0 · P1/0 · P2/1 · P3/0

lgtm ready to ci

Non-blocking Suggestions

P2

缺少 device pinning 的有效覆盖 @ rtp_llm/cpp/utils/DevicePin.h:25
- 建议：补充可在 GPU 环境运行的 device_id>=0 测试，或抽象 set-device 钩子做单测覆盖 CUDA/ROCm 分支与异常路径。

Checklist Violations (3 fail / 82 total)

General Principles Checklist

[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue 缺少 device pinning 的有效覆盖
新增 CUDA/ROCm device pinning，但测试没有覆盖 device_id>=0 的后台线程行为。
[6.1] Tests — 分布式/跨平台变更有对应覆盖 → issue 缺少 device pinning 的有效覆盖
DevicePin 新增 CUDA/ROCm 分支，但未见覆盖 device_id>=0 的 CUDA/ROCm 行为测试。

RTP-LLM Checklist

[H] 测试与 CI — 测试覆盖充分：大重构等价覆盖，新功能端到端测试 → issue 缺少 device pinning 的有效覆盖
device pinning 是新运行时行为，但没有覆盖 device_id>=0 的 CUDA/ROCm 后台线程测试。

Strengths

device_id 从 RemoteRpcServer/PyWrappedModel 传递到 cache store、Messager、TCP closure 和 CacheStoreAsyncWriter，后台 cache store 路径的设备语义更明确。
PendingTaskGuard 让后台任务抛异常时也能递减 pending_count_，降低 waitAllDone 永久等待风险。

wht21 · 2026-06-22T03:54:05Z

internal source has been updated, please review the changes!

wht21 · 2026-06-22T09:34:22Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-06-25T07:29:29Z

AI Code Review - PR #1116

Status: LGTM

Summary: P0/0 · P1/0 · P2/0 · P3/2

lgtm ready to ci

Non-blocking Suggestions

P3

PyWrappedModel.h 中 int64_t 到 int 的隐式窄化转换 @ rtp_llm/cpp/models/PyWrappedModel.h:275
- 建议：与 RemoteRpcServer.cc 保持一致，加上 static_cast：CacheStoreAsyncWriter(static_cast<int>(params.parallelism_config.local_rank))
InitParams.h 文件末尾仍缺少换行符 @ rtp_llm/cpp/disaggregate/cache_store/InitParams.h:53
- 建议：在文件末尾 } // namespace rtp_llm 后添加换行符，与 PR 中其他文件的清理保持一致。

Checklist ✅ (39 items passed)

Strengths

PendingTaskGuard RAII 模式替代手动 fetch_sub，确保异常路径下 pending_count_ 始终正确递减，提升了代码健壮性
pinThreadToDeviceOnce 使用 thread_local 缓存已 pin 设备，避免重复 cudaSetDevice 调用，对线程池场景高效
device_id 从 RemoteRpcServer/PyWrappedModel 到 NormalCacheStore、TcpMessager、TcpCacheStoreServiceImpl、TcpTransferConnection、Closure 的传播路径完整一致
所有新增 device_id 参数均默认 -1（不 pin），保持完全向后兼容

wht21 · 2026-06-25T07:32:56Z

internal source has been updated, please review the changes!

wht21 · 2026-06-25T09:45:25Z

internal source has been updated, please review the changes!

LLLLKKKK · 2026-06-29T02:34:21Z

AI Code Review - PR #1116

Status: LGTM

Summary: P0/0 · P1/0 · P2/1 · P3/1

lgtm ready to ci

Non-blocking Suggestions

P2

DevicePin.h 新工具函数无测试覆盖 @ rtp_llm/cpp/utils/DevicePin.h:15
- 建议：添加 CPU 侧单测验证 thread_local 缓存逻辑（device_id=-1 时 early return、相同 device_id 重复调用不重入）；CacheStoreAsyncWriterTest 中新增传入特定 device_id 值的用例，断言 writer.device_id_ 存储正确。GPU 侧行为通过 smoke 测试覆盖。

P3

PyWrappedModel.h 隐式窄化转换 int64_t → int @ rtp_llm/cpp/models/PyWrappedModel.h:275
- 建议：与 RemoteRpcServer.cc 保持一致，使用 static_cast(params.parallelism_config.local_rank)。

Checklist Violations (1 fail / 55 total)

General Principles Checklist

[6.1] Tests — 新逻辑有聚焦单测 + 相关集成/smoke 测试 → issue DevicePin.h 新工具函数无测试覆盖
DevicePin.h pinThreadToDeviceOnce 核心工具函数无独立单元测试。CacheStoreAsyncWriterTest 所有用例使用 device_id=-1 跳过 pinning 逻辑，device_id 参数传播未被验证。

Strengths

PendingTaskGuard RAII 模式确保异常路径下 pending_count_ 正确递减，同时修复了 pushTask 失败路径下 wait_cv_ 未 notify 的潜在竞态
thread_local 缓存使重复 pinThreadToDeviceOnce 调用接近零开销，且跨平台支持 CUDA/ROCm/CPU no-op
所有新增构造函数参数均带 device_id = -1 默认值，保持完全向后兼容
device_id 从 RemoteRpcServer/PyWrappedModel 经 CacheStoreInitParams → MessagerInitParams → 各组件完整传播，覆盖所有 worker 线程入口

wht21 · 2026-06-29T02:37:20Z

internal source has been updated, please review the changes!

Vinkle-hzt requested a review from LLLLKKKK as a code owner June 17, 2026 10:16

Vinkle-hzt added 3 commits June 22, 2026 11:33

fix: pin cache store workers to local device

3d0e78d

chore: remove redundant error handling

3fa401b

feat: add device_id support for TCP connection handling and pinning

5b08374

Vinkle-hzt force-pushed the fix/pin_device branch from cab283e to 5b08374 Compare June 22, 2026 03:35

Vinkle-hzt enabled auto-merge (squash) June 23, 2026 01:59

Merge branch 'main' into fix/pin_device

f1c9bf9

Merge branch 'main' into fix/pin_device

f10acce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: pin cache store workers to local device#1116

fix: pin cache store workers to local device#1116
Vinkle-hzt wants to merge 5 commits into
alibaba:mainfrom
Vinkle-hzt:fix/pin_device

Vinkle-hzt commented Jun 17, 2026

Uh oh!

LLLLKKKK commented Jun 17, 2026

Uh oh!

LLLLKKKK commented Jun 18, 2026

Uh oh!

LLLLKKKK commented Jun 22, 2026

Uh oh!

LLLLKKKK commented Jun 22, 2026

Uh oh!

wht21 commented Jun 22, 2026

Uh oh!

wht21 commented Jun 22, 2026

Uh oh!

LLLLKKKK commented Jun 25, 2026

Uh oh!

wht21 commented Jun 25, 2026

Uh oh!

wht21 commented Jun 25, 2026

Uh oh!

LLLLKKKK commented Jun 29, 2026

Uh oh!

wht21 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Vinkle-hzt commented Jun 17, 2026

Summary

Changes

Testing

Uh oh!

LLLLKKKK commented Jun 17, 2026

AI Code Review - PR #1116

Non-blocking Suggestions

P2

Checklist Violations (6 fail / 34 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 18, 2026

AI Code Review - PR #1116

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist Violations (3 fail / 46 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1116

Blocking Issues

P1

Non-blocking Suggestions

P2

Checklist Violations (6 fail / 38 total)

Strengths

Uh oh!

LLLLKKKK commented Jun 22, 2026

AI Code Review - PR #1116

Non-blocking Suggestions

P2

Checklist Violations (3 fail / 82 total)

Strengths

Uh oh!

wht21 commented Jun 22, 2026

Uh oh!

wht21 commented Jun 22, 2026

Uh oh!

LLLLKKKK commented Jun 25, 2026

AI Code Review - PR #1116

Non-blocking Suggestions

P3

Checklist ✅ (39 items passed)

Strengths

Uh oh!

wht21 commented Jun 25, 2026

Uh oh!

wht21 commented Jun 25, 2026

Uh oh!

LLLLKKKK commented Jun 29, 2026

AI Code Review - PR #1116

Non-blocking Suggestions

P2

P3

Checklist Violations (1 fail / 55 total)

Strengths

Uh oh!

wht21 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants