Skip to content

[kubernetes] Pods with wrong image getting stuck forever #207

Open
@aivanou

Description

@aivanou

🐛 Bug

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

When job started with image that does not exist, it is getting stuck forever. We need to provide better experience in propagating the errors to the users via torchx.

Repro:

torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=default examples/apps/dist_cifar/component.py:trainer --image dummy_image --rdzv_backend=etcd-v2 --rdzv_endpoint=etcd-server:2379 --nnodes 2 -- --epochs 1 --output_path s3://torchx-test/aivanou


torchx status $job

The second command will always show pending

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestkuberneteskubernetes and volcano schedulersmodule: runnerissues related to the torchx.runner and torchx.scheduler modules

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions