Skip to content

(TF) estimator.attach() doesn't override latest_training_job for deployment #1438

Open
@athewsey

Description

@athewsey

Describe the bug

When estimator.attach()ing to a training job older than the most recent job previously kicked off by .fit() on that same estimator object; Calling .deploy() does not properly deploy the attached job as expected; but tries to interrogate artifacts from the more recent (perhaps still in-progress) job - which can trigger an error.

This issue is probably only really relevant for interactive session use cases e.g. notebooks.

To reproduce

  • Create a TensorFlow estimator
  • Prepare at least one successful historical training job that the estimator can be attach()ed to and deploy.
  • Trigger a new training job (.fit()) and resume the interactive shell before that job is finished (e.g. by specifying wait=False, or just using Ctrl+C)
  • estimator.attach() to the previous, completed and valid training job
  • Call estimator.deploy()

Expected behavior

Because we've explicitly attached the estimator to a job since the last time we triggered one, would expect subsequent deploy() calls to refer to the attach()ed job, regardless of the relative age of the job.

Screenshots or logs (Actual behavior)

In actual fact, a KeyError is thrown because the estimator tries accessing model artifacts from the self.latest_training_job.name rather than the attached job (and we ran both attach() and deploy() before that most recent job finished):

KeyError                                  Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
    685         else:
    686             kwargs["model_kms_key"] = self.output_kms_key
--> 687             model = self.create_model(**kwargs)
    688 
    689         model.name = model_name

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in create_model(self, model_server_workers, role, vpc_config_override, endpoint_type, entry_point, source_dir, dependencies, **kwargs)
    592                 source_dir=source_dir,
    593                 dependencies=dependencies,
--> 594                 **kwargs
    595             )
    596 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tensorflow/estimator.py in _create_tfs_model(self, role, vpc_config_override, entry_point, source_dir, dependencies, **kwargs)
    623 
    624         return Model(
--> 625             model_data=self.model_data,
    626             role=role,
    627             image=(image or self.image_name),

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in model_data(self)
    709             model_uri = self.sagemaker_session.sagemaker_client.describe_training_job(
    710                 TrainingJobName=self.latest_training_job.name
--> 711             )["ModelArtifacts"]["S3ModelArtifacts"]
    712         else:
    713             logging.warning(

KeyError: 'ModelArtifacts'

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.55.3 (SageMaker conda_python3 kernel)
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
  • Framework version: 1.12
  • Python version: 3.6.5 (SageMaker conda_python3 kernel)
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions