Describe the bug
when some code error happens, dlrover master failed, some worker still running. the running worker would repeat this error message
[2026-03-12 14:50:48,019] [ERROR] [function_util.py:114:wrapped] <_InactiveRpcError of RPC that terminated with:
To Reproduce
Steps to reproduce the unexpected case:
- create job with some rank would run error.
Expected behavior
when dlrover master failed, the rest running and pending pod would be stopped.
APP Info (please complete the following information):
Describe the bug
when some code error happens, dlrover master failed, some worker still running. the running worker would repeat this error message
To Reproduce
Steps to reproduce the unexpected case:
Expected behavior
when dlrover master failed, the rest running and pending pod would be stopped.
APP Info (please complete the following information):