[kubernetes] Pods with wrong image getting stuck forever

## 🐛 Bug



Module (check all that applies):
 * [ ] `torchx.spec`
 * [ ] `torchx.component`
 * [ ] `torchx.apps`
 * [ ] `torchx.runtime`
 * [ ] `torchx.cli`
 * [x] `torchx.schedulers`
 * [ ] `torchx.pipelines`
 * [ ] `torchx.aws`
 * [ ] `torchx.examples`
 * [ ] `other`


When job started with image that does not exist, it is getting stuck forever. We need to provide better experience in propagating the errors to the users via torchx.

Repro:

    torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=default examples/apps/dist_cifar/component.py:trainer --image dummy_image --rdzv_backend=etcd-v2 --rdzv_endpoint=etcd-server:2379 --nnodes 2 -- --epochs 1 --output_path s3://torchx-test/aivanou


    torchx status $job


The second command will always show pending

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[kubernetes] Pods with wrong image getting stuck forever #207

🐛 Bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[kubernetes] Pods with wrong image getting stuck forever #207

Description

🐛 Bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions