Description
Describe the bug
Having followed virtually all guidance available on experiment tracking for a training job with "bring your own script" model, it seems sagemaker always decides to crate a separate run for the training job. So for each run I end up with two runs: one which I initialize and all the metrics are logged to, a second 'pytorch-training--aws-training-job' which contains the output model artifacts and the debug info.
To reproduce
relevant excerpt from code that initiates the training jon
`
from sagemaker.pytorch import PyTorch
from sagemaker.experiments import Run
experiment_name = 'test_experiment'
run_name = 'test_run'
with Run(experiment_name=experiment_name, run_name=run_name) as run:
est = PyTorch(
entry_point="./job.py",
role=role,
model_dir=False,
framework_version="2.2",
py_version="py310",
instance_type="ml.g5.12xlarge",
instance_count=1,
hyperparameters=hyperparameters
)
est.fit()
`
relevant excerpt from job.py
`
if name == "main":
from sagemaker.session import Session
from sagemaker.experiments.run import load_run
session = Session(boto3.session.Session(region_name='us-west-2'))
with load_run(sagemaker_session=session) as run:
# Log all parameters
run.log_parameters({k:str(v) for k,v in vars(args).items()})
run.log_parameter('job_name', str(job_name))
execute(args, run)
`
Expected behavior
the experiment config passed to the estimator should correctly contain the run and the experiment and the training job should be associated with the run.
System information
A description of your system. Please provide:
- SageMaker Python SDK version:2.212.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version:2.2
- Python version:310
- CPU or GPU:GPU
- Custom Docker image (Y/N):N
Additional context
I've tried virtually everything including manually passing the experiment config.