Skip to content

Commit 0598ea8

Browse files
Added steps to install slinky on K8s and example training workload
1 parent 0640227 commit 0598ea8

6 files changed

+1157
-0
lines changed

slinky/Readme.md

+139
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Example Slinky Training Workload on Kubernetes
2+
3+
The following outlines steps to get up and running with Slinky on Kubernetes and running a simple image classification training workload to verify GPUs are accessible.
4+
5+
## Clone this repo and go into slinky folder
6+
7+
```bash
8+
git clone https://github.com/amd/ada.git
9+
cd slinky
10+
```
11+
12+
## Installing Slinky Prerequisites
13+
14+
The following steps for installing pre-requisites and installing Slinky have been taking from the SlinkProject/slinky-operator repo [quick-start guide](https://github.com/SlinkyProject/slurm-operator/blob/main/docs/quickstart.md)
15+
16+
```bash
17+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
18+
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
19+
helm repo add bitnami https://charts.bitnami.com/bitnami
20+
helm repo add jetstack https://charts.jetstack.io
21+
helm repo update
22+
helm install cert-manager jetstack/cert-manager \
23+
--namespace cert-manager --create-namespace --set crds.enabled=true
24+
helm install prometheus prometheus-community/kube-prometheus-stack \
25+
--namespace prometheus --create-namespace --set installCRDs=true
26+
```
27+
28+
## Installing Slinky Operator
29+
30+
```bash
31+
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
32+
--values=values-operator.yaml --version=0.1.0 --namespace=slinky --create-namespace
33+
```
34+
35+
Make sure the operator deployed successfully with:
36+
37+
```sh
38+
kubectl --namespace=slinky get pods
39+
```
40+
41+
Output should be similar to:
42+
43+
```sh
44+
NAME READY STATUS RESTARTS AGE
45+
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
46+
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s
47+
```
48+
49+
## Installing Slurm Cluster
50+
51+
```bash
52+
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
53+
--values=values-slurm.yaml --version=0.1.0 --namespace=slurm --create-namespace
54+
```
55+
56+
Make sure the Slurm cluster deployed successfully with:
57+
58+
```sh
59+
kubectl --namespace=slurm get pods
60+
```
61+
62+
Output should be similar to:
63+
64+
```sh
65+
NAME READY STATUS RESTARTS AGE
66+
slurm-accounting-0 1/1 Running 0 5m00s
67+
slurm-compute-gpu-node 1/1 Running 0 5m00s
68+
slurm-controller-0 2/2 Running 0 5m00s
69+
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
70+
slurm-mariadb-0 1/1 Running 0 5m00s
71+
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s
72+
```
73+
74+
## Prepping Compute Node
75+
76+
1. Get SLURM Compute Node Name
77+
78+
```bash
79+
SLURM_COMPUTE_POD=$(kubectl get pods -n slurm | grep ^slurm-compute-gpu-node | awk '{print $1}');echo $SLURM_COMPUTE_POD
80+
```
81+
82+
2. Add Slurm user to video and render group and create Slurm user home directory to Slrum Compute node
83+
84+
```bash
85+
kubectl exec -it -n slurm $SLURM_COMPUTE_POD -- bash -c "
86+
usermod -aG video,render slurm
87+
mkdir -p /home/slurm
88+
chown slurm:slurm /home/slurm"
89+
```
90+
91+
3. Copy PyTorch test script to Slurm compute node
92+
93+
```bash
94+
kubectl cp test.py slurm/$SLURM_COMPUTE_POD:/tmp/test.py
95+
```
96+
97+
4. Copy Fashion MNIST Image Classification Model Training script to Slurm compute node
98+
99+
```bash
100+
kubectl cp train_fashion_mnist.py slurm/$SLURM_COMPUTE_POD:/tmp/train_fashion_mnist.py
101+
```
102+
103+
5. Run test.py script on compute node to confirm GPUs are accessible
104+
105+
```bash
106+
kubectl exec -it slurm-controller-0 -n slurm -- srun python3 test.py
107+
```
108+
109+
6. Run single-GPU training script on compute node
110+
111+
```bash
112+
kubectl exec -it slurm-controller-0 -n slurm -- srun python3 train_fashion_mnist.py
113+
```
114+
115+
7. Run multi-GPU training script on compute node
116+
117+
```bash
118+
kubectl exec -it slurm-controller-0 -n slurm -- srun apptainer exec --rocm --bind /tmp:/tmp torch_rocm.sif torchrun --standalone --nnodes=1 --nproc_per_node=8 --master-addr localhost train_mnist_distributed.py
119+
```
120+
121+
## Other Useful Slurm Commands
122+
123+
### Check Slurm Node Info
124+
125+
```bash
126+
kubectl exec -it slurm-controller-0 -n slurm -- sinfo
127+
```
128+
129+
### Check Job Queue
130+
131+
```bash
132+
kubectl exec -it slurm-controller-0 -n slurm -- squeue
133+
```
134+
135+
### Check Node Resources
136+
137+
```bash
138+
kubectl exec -it slurm-controller-0 -n slurm -- sinfo -N -o "%N %G"
139+
```

slinky/test.py

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# run this command to check if the GPUs are available
2+
# srun -N 2 --gpus=16 -t 00:02:00 python3 test.py
3+
import torch
4+
5+
if torch.cuda.is_available():
6+
print(f"GPUs available: {torch.cuda.device_count()}")
7+
for i in range(torch.cuda.device_count()):
8+
print(f" - GPU {i}: {torch.cuda.get_device_name(i)}")
9+
print(f" - GPU {i} Pytorch and rocm version: {torch.__version__}")
10+
print(f" - GPU {i} Nccl version: {torch.cuda.nccl.version()}")
11+
else:
12+
print("No GPUs available.")

slinky/train_fashion_mnist.py

+120
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
import os
2+
3+
# Set the Torch Distributed env variables so the training function can be run locally in the Notebook.
4+
# See https://pytorch.org/docs/stable/elastic/run.html#environment-variables
5+
os.environ["RANK"] = "0"
6+
os.environ["LOCAL_RANK"] = "0"
7+
os.environ["WORLD_SIZE"] = "1"
8+
os.environ["MASTER_ADDR"] = "localhost"
9+
os.environ["MASTER_PORT"] = "1234"
10+
11+
def train_fashion_mnist():
12+
import torch
13+
import torch.distributed as dist
14+
import torch.nn.functional as F
15+
from torch import nn
16+
from torch.utils.data import DataLoader, DistributedSampler
17+
from torchvision import datasets, transforms
18+
19+
# Define the PyTorch CNN model to be trained
20+
class Net(nn.Module):
21+
def __init__(self):
22+
super(Net, self).__init__()
23+
self.conv1 = nn.Conv2d(1, 20, 5, 1)
24+
self.conv2 = nn.Conv2d(20, 50, 5, 1)
25+
self.fc1 = nn.Linear(4 * 4 * 50, 500)
26+
self.fc2 = nn.Linear(500, 10)
27+
28+
def forward(self, x):
29+
x = F.relu(self.conv1(x))
30+
x = F.max_pool2d(x, 2, 2)
31+
x = F.relu(self.conv2(x))
32+
x = F.max_pool2d(x, 2, 2)
33+
x = x.view(-1, 4 * 4 * 50)
34+
x = F.relu(self.fc1(x))
35+
x = self.fc2(x)
36+
return F.log_softmax(x, dim=1)
37+
38+
# Use NCCL if a GPU is available, otherwise use Gloo as communication backend.
39+
device, backend = ("cuda", "nccl") if torch.cuda.is_available() else ("cpu", "gloo")
40+
print(f"Using Device: {device}, Backend: {backend}")
41+
42+
# Setup PyTorch distributed.
43+
local_rank = int(os.getenv("LOCAL_RANK", 0))
44+
dist.init_process_group(backend=backend)
45+
print(
46+
"Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
47+
dist.get_world_size(),
48+
dist.get_rank(),
49+
local_rank,
50+
)
51+
)
52+
53+
# Create the model and load it into the device.
54+
device = torch.device(f"{device}:{local_rank}")
55+
model = nn.parallel.DistributedDataParallel(Net().to(device))
56+
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
57+
58+
59+
# Download FashionMNIST dataset only on local_rank=0 process.
60+
if local_rank == 0:
61+
dataset = datasets.FashionMNIST(
62+
"./data",
63+
train=True,
64+
download=True,
65+
transform=transforms.Compose([transforms.ToTensor()]),
66+
)
67+
dist.barrier()
68+
dataset = datasets.FashionMNIST(
69+
"./data",
70+
train=True,
71+
download=False,
72+
transform=transforms.Compose([transforms.ToTensor()]),
73+
)
74+
75+
76+
# Shard the dataset accross workers.
77+
train_loader = DataLoader(
78+
dataset,
79+
batch_size=100,
80+
sampler=DistributedSampler(dataset)
81+
)
82+
83+
# TODO(astefanutti): add parameters to the training function
84+
dist.barrier()
85+
for epoch in range(1, 10):
86+
model.train()
87+
88+
# Iterate over mini-batches from the training set
89+
for batch_idx, (inputs, labels) in enumerate(train_loader):
90+
# Copy the data to the GPU device if available
91+
inputs, labels = inputs.to(device), labels.to(device)
92+
# Forward pass
93+
outputs = model(inputs)
94+
loss = F.nll_loss(outputs, labels)
95+
# Backward pass
96+
optimizer.zero_grad()
97+
loss.backward()
98+
optimizer.step()
99+
100+
if batch_idx % 10 == 0 and dist.get_rank() == 0:
101+
print(
102+
"Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
103+
epoch,
104+
batch_idx * len(inputs),
105+
len(train_loader.dataset),
106+
100.0 * batch_idx / len(train_loader),
107+
loss.item(),
108+
)
109+
)
110+
111+
# Wait for the distributed training to complete
112+
dist.barrier()
113+
if dist.get_rank() == 0:
114+
print("Training is finished")
115+
116+
# Finally clean up PyTorch distributed
117+
dist.destroy_process_group()
118+
119+
# Run the training function locally.
120+
train_fashion_mnist()

slinky/train_mnist_distributed.py

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
import os
2+
3+
# Set the Torch Distributed env variables so the training function can be run locally in the Notebook.
4+
# See https://pytorch.org/docs/stable/elastic/run.html#environment-variables
5+
os.environ["RANK"] = "0"
6+
os.environ["WORLD_SIZE"] = "1"
7+
8+
def train_fashion_mnist():
9+
import torch
10+
import torch.distributed as dist
11+
import torch.nn.functional as F
12+
from torch import nn
13+
from torch.utils.data import DataLoader, DistributedSampler
14+
from torchvision import datasets, transforms
15+
16+
# Define the PyTorch CNN model to be trained
17+
class Net(nn.Module):
18+
def __init__(self):
19+
super(Net, self).__init__()
20+
self.conv1 = nn.Conv2d(1, 20, 5, 1)
21+
self.conv2 = nn.Conv2d(20, 50, 5, 1)
22+
self.fc1 = nn.Linear(4 * 4 * 50, 500)
23+
self.fc2 = nn.Linear(500, 10)
24+
25+
def forward(self, x):
26+
x = F.relu(self.conv1(x))
27+
x = F.max_pool2d(x, 2, 2)
28+
x = F.relu(self.conv2(x))
29+
x = F.max_pool2d(x, 2, 2)
30+
x = x.view(-1, 4 * 4 * 50)
31+
x = F.relu(self.fc1(x))
32+
x = self.fc2(x)
33+
return F.log_softmax(x, dim=1)
34+
35+
# Use NCCL if a GPU is available, otherwise use Gloo as communication backend.
36+
device, backend = ("cuda", "nccl") if torch.cuda.is_available() else ("cpu", "gloo")
37+
print(f"Using Device: {device}, Backend: {backend}")
38+
39+
# Setup PyTorch distributed.
40+
local_rank = int(os.getenv("LOCAL_RANK", 0))
41+
dist.init_process_group(backend=backend)
42+
print(
43+
"Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
44+
dist.get_world_size(),
45+
dist.get_rank(),
46+
local_rank,
47+
)
48+
)
49+
50+
# Create the model and load it into the device.
51+
device = torch.device(f"{device}:{local_rank}")
52+
model = nn.parallel.DistributedDataParallel(Net().to(device))
53+
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
54+
55+
56+
# Download FashionMNIST dataset only on local_rank=0 process.
57+
if local_rank == 0:
58+
dataset = datasets.FashionMNIST(
59+
"./data",
60+
train=True,
61+
download=True,
62+
transform=transforms.Compose([transforms.ToTensor()]),
63+
)
64+
dist.barrier()
65+
dataset = datasets.FashionMNIST(
66+
"./data",
67+
train=True,
68+
download=False,
69+
transform=transforms.Compose([transforms.ToTensor()]),
70+
)
71+
72+
73+
# Shard the dataset accross workers.
74+
train_loader = DataLoader(
75+
dataset,
76+
batch_size=100,
77+
sampler=DistributedSampler(dataset)
78+
)
79+
80+
# TODO(astefanutti): add parameters to the training function
81+
dist.barrier()
82+
for epoch in range(1, 10):
83+
model.train()
84+
85+
# Iterate over mini-batches from the training set
86+
for batch_idx, (inputs, labels) in enumerate(train_loader):
87+
# Copy the data to the GPU device if available
88+
inputs, labels = inputs.to(device), labels.to(device)
89+
# Forward pass
90+
outputs = model(inputs)
91+
loss = F.nll_loss(outputs, labels)
92+
# Backward pass
93+
optimizer.zero_grad()
94+
loss.backward()
95+
optimizer.step()
96+
97+
if batch_idx % 10 == 0 and dist.get_rank() == 0:
98+
print(
99+
"Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}".format(
100+
epoch,
101+
batch_idx * len(inputs),
102+
len(train_loader.dataset),
103+
100.0 * batch_idx / len(train_loader),
104+
loss.item(),
105+
)
106+
)
107+
108+
# Wait for the distributed training to complete
109+
dist.barrier()
110+
if dist.get_rank() == 0:
111+
print("Training is finished")
112+
113+
# Finally clean up PyTorch distributed
114+
dist.destroy_process_group()
115+
116+
# Run the training function locally.
117+
train_fashion_mnist()

0 commit comments

Comments
 (0)