Skip to content

Commit abbc942

Browse files
DevakiBolleneniDevakiBolleneni
andauthored
[PyTorch][Training][EC2][SageMaker]PyTorch 2.9 Currency Release (#5407)
* Add docker files to PT2.9 training * removed the pins and updated versions * fixed the pins in cpu file as well * Modified the Buildspec files and toml file * Removed fastai temporarily * rebuilding after pinning opencv-python * rebuild with updated base image * corrected base image and few typos * adding additional dependency for TE 2.8 * Enable efa log and modify the license file * Modify the license file * Add pt2.9 ec2 test file * fix typo and enable host networking * fix formatting and skip test_fused_attn.py * Fix formatting in common_cases.py * Fix EFA NCCL failure * Fix EFA NCCL failure * Fix the script to detect actual network interface * update prbase image and revert back the NCCL changes * modify the ofi-nccl path * build sm image * add fastai and update TE version * rebuild ec2 image with fastai * rebuild sm image and test * update base image and flashattention wheel * rebuild sm image with enabled security tests * rebuild ec2 image * rerun jobs after deleting AML2_CPU_ARM64_US_EAST_1 * rerun jobs after disabling safety check test and ecr scan allowlist * update MAX_JOBS and try rebuild * rebuild ec2 image with safety check test and ecr scan allowlist * rebuild ec2 image and run tests * rebuild sm image and run tests * skip smppy tests and rerun * rerun after enabling safety check test and ecr scan allowlist * rebuild ec2 image * fix formatting * Rerun SM tests * Revert testEFA changes and run * Revert toml file --------- Co-authored-by: DevakiBolleneni <[email protected]>
1 parent 0d571db commit abbc942

File tree

10 files changed

+915
-3
lines changed

10 files changed

+915
-3
lines changed
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2+
prod_account_id: &PROD_ACCOUNT_ID 763104351884
3+
region: &REGION <set-$REGION-in-environment>
4+
framework: &FRAMEWORK pytorch
5+
version: &VERSION 2.9.0
6+
short_version: &SHORT_VERSION "2.9"
7+
arch_type: x86
8+
# autopatch_build: "True"
9+
10+
repository_info:
11+
training_repository: &TRAINING_REPOSITORY
12+
image_type: &TRAINING_IMAGE_TYPE training
13+
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
14+
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
15+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
16+
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
17+
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]
18+
19+
context:
20+
training_context: &TRAINING_CONTEXT
21+
start_cuda_compat:
22+
source: docker/build_artifacts/start_cuda_compat.sh
23+
target: start_cuda_compat.sh
24+
dockerd_entrypoint:
25+
source: docker/build_artifacts/dockerd_entrypoint.sh
26+
target: dockerd_entrypoint.sh
27+
changehostname:
28+
source: docker/build_artifacts/changehostname.c
29+
target: changehostname.c
30+
start_with_right_hostname:
31+
source: docker/build_artifacts/start_with_right_hostname.sh
32+
target: start_with_right_hostname.sh
33+
example_mnist_file:
34+
source: docker/build_artifacts/mnist.py
35+
target: mnist.py
36+
deep_learning_container:
37+
source: ../../src/deep_learning_container.py
38+
target: deep_learning_container.py
39+
setup_oss_compliance:
40+
source: ../../scripts/setup_oss_compliance.sh
41+
target: setup_oss_compliance.sh
42+
43+
images:
44+
BuildEC2CPUPTTrainPy3DockerImage:
45+
<<: *TRAINING_REPOSITORY
46+
build: &PYTORCH_CPU_TRAINING_PY3 false
47+
image_size_baseline: 7200
48+
device_type: &DEVICE_TYPE cpu
49+
python_version: &DOCKER_PYTHON_VERSION py3
50+
tag_python_version: &TAG_PYTHON_VERSION py312
51+
os_version: &OS_VERSION ubuntu22.04
52+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-ec2" ]
53+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-ec2" ]
54+
# skip_build: "False"
55+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
56+
target: ec2
57+
context:
58+
<<: *TRAINING_CONTEXT
59+
BuildEC2GPUPTTrainPy3cu130DockerImage:
60+
<<: *TRAINING_REPOSITORY
61+
build: &PYTORCH_GPU_TRAINING_PY3 false
62+
image_size_baseline: 28000
63+
device_type: &DEVICE_TYPE gpu
64+
python_version: &DOCKER_PYTHON_VERSION py3
65+
tag_python_version: &TAG_PYTHON_VERSION py312
66+
cuda_version: &CUDA_VERSION cu130
67+
os_version: &OS_VERSION ubuntu22.04
68+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
69+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-ec2" ]
70+
# skip_build: "False"
71+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
72+
*DEVICE_TYPE ]
73+
target: ec2
74+
context:
75+
<<: *TRAINING_CONTEXT
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2+
prod_account_id: &PROD_ACCOUNT_ID 763104351884
3+
region: &REGION <set-$REGION-in-environment>
4+
framework: &FRAMEWORK pytorch
5+
version: &VERSION 2.9.0
6+
short_version: &SHORT_VERSION "2.9"
7+
arch_type: x86
8+
# autopatch_build: "True"
9+
10+
repository_info:
11+
training_repository: &TRAINING_REPOSITORY
12+
image_type: &TRAINING_IMAGE_TYPE training
13+
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
14+
repository_name: &REPOSITORY_NAME !join [ pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
15+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
16+
release_repository_name: &RELEASE_REPOSITORY_NAME !join [ *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE ]
17+
release_repository: &RELEASE_REPOSITORY !join [ *PROD_ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *RELEASE_REPOSITORY_NAME ]
18+
19+
context:
20+
training_context: &TRAINING_CONTEXT
21+
start_cuda_compat:
22+
source: docker/build_artifacts/start_cuda_compat.sh
23+
target: start_cuda_compat.sh
24+
dockerd_entrypoint:
25+
source: docker/build_artifacts/dockerd_entrypoint.sh
26+
target: dockerd_entrypoint.sh
27+
changehostname:
28+
source: docker/build_artifacts/changehostname.c
29+
target: changehostname.c
30+
start_with_right_hostname:
31+
source: docker/build_artifacts/start_with_right_hostname.sh
32+
target: start_with_right_hostname.sh
33+
example_mnist_file:
34+
source: docker/build_artifacts/mnist.py
35+
target: mnist.py
36+
deep_learning_container:
37+
source: ../../src/deep_learning_container.py
38+
target: deep_learning_container.py
39+
setup_oss_compliance:
40+
source: ../../scripts/setup_oss_compliance.sh
41+
target: setup_oss_compliance.sh
42+
43+
images:
44+
BuildSageMakerCPUPTTrainPy3DockerImage:
45+
<<: *TRAINING_REPOSITORY
46+
build: &PYTORCH_CPU_TRAINING_PY3 false
47+
image_size_baseline: 7200
48+
device_type: &DEVICE_TYPE cpu
49+
python_version: &DOCKER_PYTHON_VERSION py3
50+
tag_python_version: &TAG_PYTHON_VERSION py312
51+
os_version: &OS_VERSION ubuntu22.04
52+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
53+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION, "-sagemaker" ]
54+
# skip_build: "False"
55+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
56+
target: sagemaker
57+
context:
58+
<<: *TRAINING_CONTEXT
59+
BuildSageMakerGPUPTTrainPy3DockerImage:
60+
<<: *TRAINING_REPOSITORY
61+
build: &PYTORCH_GPU_TRAINING_PY3 false
62+
image_size_baseline: 28000
63+
device_type: &DEVICE_TYPE gpu
64+
python_version: &DOCKER_PYTHON_VERSION py3
65+
tag_python_version: &TAG_PYTHON_VERSION py312
66+
cuda_version: &CUDA_VERSION cu130
67+
os_version: &OS_VERSION ubuntu22.04
68+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
69+
latest_release_tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION, "-sagemaker" ]
70+
# skip_build: "False"
71+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile.,
72+
*DEVICE_TYPE ]
73+
target: sagemaker
74+
context:
75+
<<: *TRAINING_CONTEXT

pytorch/training/buildspec.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
buildspec_pointer: buildspec-2-8-sm.yml
1+
buildspec_pointer: buildspec-2-9-ec2.yml

0 commit comments

Comments
 (0)