Open
Description
🐛 Bug
Module (check all that applies):
-
torchx.spec
-
torchx.component
-
torchx.apps
-
torchx.runtime
-
torchx.cli
-
torchx.schedulers
-
torchx.pipelines
-
torchx.aws
-
torchx.examples
-
other
When job started with image that does not exist, it is getting stuck forever. We need to provide better experience in propagating the errors to the users via torchx.
Repro:
torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=default examples/apps/dist_cifar/component.py:trainer --image dummy_image --rdzv_backend=etcd-v2 --rdzv_endpoint=etcd-server:2379 --nnodes 2 -- --epochs 1 --output_path s3://torchx-test/aivanou
torchx status $job
The second command will always show pending