Skip to content

Dlrover master fail with worker running #1710

@ljye2023

Description

@ljye2023

Describe the bug
when some code error happens, dlrover master failed, some worker still running. the running worker would repeat this error message

[2026-03-12 14:50:48,019] [ERROR] [function_util.py:114:wrapped] <_InactiveRpcError of RPC that terminated with:

To Reproduce
Steps to reproduce the unexpected case:

  • create job with some rank would run error.

Expected behavior
when dlrover master failed, the rest running and pending pod would be stopped.

APP Info (please complete the following information):

  • DLRover: [0.6.1]

Metadata

Metadata

Assignees

Labels

reporttodoissue or pr with 'todo' will ignore expiration

Type

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions