Description
Describe the bug
When estimator.attach()
ing to a training job older than the most recent job previously kicked off by .fit()
on that same estimator object; Calling .deploy()
does not properly deploy the attached job as expected; but tries to interrogate artifacts from the more recent (perhaps still in-progress) job - which can trigger an error.
This issue is probably only really relevant for interactive session use cases e.g. notebooks.
To reproduce
- Create a
TensorFlow
estimator - Prepare at least one successful historical training job that the estimator can be
attach()
ed to and deploy. - Trigger a new training job (
.fit()
) and resume the interactive shell before that job is finished (e.g. by specifyingwait=False
, or just using Ctrl+C) estimator.attach()
to the previous, completed and valid training job- Call
estimator.deploy()
Expected behavior
Because we've explicitly attached the estimator to a job since the last time we triggered one, would expect subsequent deploy()
calls to refer to the attach()
ed job, regardless of the relative age of the job.
Screenshots or logs (Actual behavior)
In actual fact, a KeyError
is thrown because the estimator tries accessing model artifacts from the self.latest_training_job.name
rather than the attached job (and we ran both attach()
and deploy()
before that most recent job finished):
KeyError Traceback (most recent call last)
<timed exec> in <module>()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
685 else:
686 kwargs["model_kms_key"] = self.output_kms_key
--> 687 model = self.create_model(**kwargs)
688
689 model.name = model_name
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in create_model(self, model_server_workers, role, vpc_config_override, endpoint_type, entry_point, source_dir, dependencies, **kwargs)
592 source_dir=source_dir,
593 dependencies=dependencies,
--> 594 **kwargs
595 )
596
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in _create_tfs_model(self, role, vpc_config_override, entry_point, source_dir, dependencies, **kwargs)
623
624 return Model(
--> 625 model_data=self.model_data,
626 role=role,
627 image=(image or self.image_name),
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in model_data(self)
709 model_uri = self.sagemaker_session.sagemaker_client.describe_training_job(
710 TrainingJobName=self.latest_training_job.name
--> 711 )["ModelArtifacts"]["S3ModelArtifacts"]
712 else:
713 logging.warning(
KeyError: 'ModelArtifacts'
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 1.55.3 (SageMaker conda_python3 kernel)
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
- Framework version: 1.12
- Python version: 3.6.5 (SageMaker conda_python3 kernel)
- CPU or GPU: CPU
- Custom Docker image (Y/N): N
Additional context
Add any other context about the problem here.