This is a Tutorial of Google Kubernetes Engine
Table of Contents
This hands-on course aims to familiarize participants with the basic operations of k8s and GKE, and to build a simple web service. In the process, participants will also learn how to use GPUs in Autopilot, autoscale, and build Spot Pods. In addition, we will build a CI/CD pipeline to automate container image updates using GCP's managed services. Once you understand this hands-on, you will have a complete foundation on which to run Ray Cluster.
With the widespread adoption of containers among organizations, Kubernetes, the container-centric management software, has become the de facto standard to deploy and operate containerized applications.
GKE (Google Kubernetes Engine)is the most scalable and fully automated Kubernetes service
In Cloud Shell
Clone this github repo
$ git clone https://github.com/khosino/gke-tutarial.git
$ cd gke-tutorial
$ ls -l
Please check the Dockerfile and index.html! (Cloud Shell Editor is recommended)
--tag option means naming the image.
$ cd docker
$ docker build . --tag test-web-image
Check the result image.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
test-web-image latest 2298801489d6 42 seconds ago 435MB
This is the Doc how to push the image to Artifact Registry
Need to set the tag like below. You can copy from Artifact Registry console in Google Cloud.
{LOCATION}-docker.pkg.dev/{PROJECT-ID}/{REPOSITORY_NAME}/{IMAGE_NAME}:{VERSION_TAG}
Change the image tag to push
$ docker tag test-web-image:latest asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image latest 2298801489d6 6 minutes ago 435MB
test-web-image latest 2298801489d6 6 minutes ago 435MB
Execute the push command
$ docker push asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image
Check the result. It's pushed to Artifact Registry
Select the CONNECT on the console and change the project info to yours.
$ gcloud container clusters get-credentials autopilot-cluster-1 --region asia-northeast1 --project gke-tutorial-hclsj
Check kubectl
command
## Command Cheet Sheet
# Node check
$ kubectl get nodes
# Pod check
kubectl get pods
kubectl get pods -n {namespace(e.g. default, kube-system)}
kubectl get pods -o wide
# Apply the manifest of kubernetes
kubectl apply -f {YAML file path}
kubectl apply -f {directly path including YAML}
$ kubectl get nodes
$ kubectl get pods
$ kubectl get pods -n kube-system
Check the kubenetes manifest file in Pod directory and change the image path.
$ cd manifest
$ kubectl apply -f pod.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-web 1/1 Running 0 2m52s
The pod is created. To check the detail log, use describe command. However we cannot access to this pod because there is no service or ingress.
kubectl describe pods test-web
Deploy the Deployment with 3 replicas.
$ kubectl apply -f deployment.yaml
$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
test-web-deployment 3/3 3 3 2m47s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-web-deployment-d7779f6f4-crvvh 1/1 Running 0 3m19s
test-web-deployment-d7779f6f4-mwnxs 1/1 Running 0 2m18s
test-web-deployment-d7779f6f4-vwwsd 1/1 Running 0 3m19s
Check the behavior when you try to delete a pod?
kubectl delete pods test-web-deployment-d7779f6f4-crvvh
Deploy the Service
$ kubectl apply -f service.yaml
$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.118.128.1 <none> 443/TCP 7d8h
test-web LoadBalancer 10.118.130.151 34.85.xxx.xxx 80:30738/TCP 29m
Access to LoadBalancer EXTERNAL-IP (34.85.xxx.xxx)
The LoadBalancer is created automatically in Google Cloud.
In Autopilot, you request resources in your Pod specification. If you do not specify resource requests for some containers in a Pod, Autopilot applies default values.
check the current resources
Resource requests in Autopilot
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-web-deployment-d7779f6f4-crvvh 1/1 Running 0 11h
test-web-deployment-d7779f6f4-mwnxs 1/1 Running 0 11h
test-web-deployment-d7779f6f4-vwwsd 1/1 Running 0 11h
$ kubectl describe pods test-web-deployment-d7779f6f4-crvvh
~~~
Requests:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
~~~
# This is the default value in case you don't specify.
Change the manifest file Open the comment "Resrouce Request CPU" in deployment.yaml
Manifest "development.yaml"
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-web-deployment
labels:
app: test-web
spec:
replicas: 3
selector:
matchLabels:
app: test-web
template:
metadata:
labels:
app: test-web
spec:
# ### Resource Request GPU
# nodeSelector:
# cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
# ###
# ### Best-effort Spot Pod
# terminationGracePeriodSeconds: 25
# affinity:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: cloud.google.com/gke-spot
# operator: In
# values:
# - "true"
# ###
containers:
- name: test-web
image: asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest
### Resource Request CPU
resources:
requests:
cpu: 250m
memory: 1Gi
###
# ### Resource Request GPU
# resources:
# limits:
# nvidia.com/gpu: 2
# requests:
# cpu: "36"
# memory: "36Gi"
# ###
ports:
- containerPort: 80
$ kubectl apply -f deployment.yaml
deployment.apps/test-web-deployment configured
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
test-web-deployment-6587db96c8-5fsgb 1/1 Running 0 7m8s
test-web-deployment-6587db96c8-5l9wr 1/1 Terminating 0 5m31s
test-web-deployment-d7779f6f4-4cqqm 0/1 Pending 0 1s
test-web-deployment-d7779f6f4-kdj4h 1/1 Running 0 6s
test-web-deployment-d7779f6f4-nssz8 1/1 Running 0 2m29s
# Rolling update
# Always keep 3 pods available
$ kubectl describe pods test-web-deployment-d7779f6f4-kdj4h
~~~
Requests:
cpu: 250m
ephemeral-storage: 1Gi
memory: 1Gi
~~~
Check the quotas in the project
Open the comment "Resrouce Request GPU" and close "Resrouce Request CPU" in deployment.yaml
Manifest "development.yaml"
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-web-deployment
labels:
app: test-web
spec:
replicas: 3
selector:
matchLabels:
app: test-web
template:
metadata:
labels:
app: test-web
spec:
### Resource Request GPU
nodeSelector:
cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
###
# ### Best-effort Spot Pod
# terminationGracePeriodSeconds: 25
# affinity:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: cloud.google.com/gke-spot
# operator: In
# values:
# - "true"
# ###
containers:
- name: test-web
image: asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest
# ### Resource Request CPU
# resources:
# requests:
# cpu: 250m
# memory: 1Gi
# ###
### Resource Request GPU
resources:
limits:
nvidia.com/gpu: 2
requests:
cpu: "36"
memory: "36Gi"
###
ports:
- containerPort: 80
$ kubectl apply -f deployment.yaml
$ kubectl get pods
$ kubectl describe pods test-web-deployment-6b67f6c455-2jlgr
~~~
Requests:
cpu: 18
ephemeral-storage: 1Gi
memory: 18Gi
nvidia.com/gpu: 1
~~~
Node-Selectors: cloud.google.com/gke-accelerator=nvidia-tesla-t4
cloud.google.com/gke-accelerator-count=1
~~~
Spot Pods are Pods that run on nodes backed by Compute Engine Spot VMs. Spot Pods are priced lower than standard Autopilot Pods, but can be evicted by GKE whenever compute resources are required to run standard Pods.
Requesting Spot Pods on a best-effort basis When you request Spot Pods on a preferred basis, GKE schedules your Pods based on the following order:
- Existing nodes that can run Spot Pods that have available allocatable capacity.
- Existing standard nodes that have available allocatable capacity.
- New nodes that can run Spot Pods, if the compute resources are available.
- New standard nodes.
Deploy the best-effort basis Spot Pods with GPU
Open the comment "Best-effort Spot Pods" in development.yaml
Manifest "development.yaml"
~~
spec:
### Resource Request GPU
nodeSelector:
cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
###
### Best-effort Spot Pod
terminationGracePeriodSeconds: 25
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-spot
operator: In
values:
- "true"
###
containers:
- name: test-web
image: asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest
~~
$ kubectl apply -f deployment.yaml
$ kubectl get pods
$ kubectl describe pods
~~~
Tolerations: cloud.google.com/gke-accelerator=nvidia-tesla-t4:NoSchedule
cloud.google.com/gke-spot=true:NoSchedule
~~~
The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload's CPU or memory consumption, or in response to custom metrics reported from within Kubernetes or external metrics from sources outside of your cluster.
This example creates HorizontalPodAutoscaler object to autoscale the Deployment when CPU utilization surpasses 50%, and ensures that there is always a minimum of 3 replica and a maximum of 10 replicas.
$ kubectl apply -f hpa.yaml
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
test-web-hpa Deployment/test-web-deployment 19%/50% 3 10 3 41m
If the load exceeds 50%, Replica Pods are created as follows
[Ref.] For a simple load, log in to the container and execute a lot of "yes" commands.
$ kubectl exec -it test-web-deployment-5bd8b78bfc-675bj /bin/bash
$ yes > dev/null &
(repeat it)
Continuous integration and continuous delivery (CI/CD) are essential processes to deliver software quickly and reliably. CI/CD helps to automate the build, test, and deployment process, which can save time and reduce errors.
In your case, for example, you can simply write the source code for the machine learning to be used in Ray Cluster and upload it to git, which will automatically "build the container", "push to Docker Registory", "deploy to GKE", etc. In other words, you can do everything in this tutorial automatically.
Create a repository with Source Repositories, a Google Cloud code management service. If you usually use Github, you can use that repository as is. In this hands-on, we will create a repository in Source Repositories and set up a connection with Github. Whenever there is a change in Github, Cloud Build will be activated via Source Repositories.
from "Add a repositry", create new or connect you repo (e.g. github).
If you select connect existing repositries.
It's cloned from github to Source Repositries.
Cloud Build Trigger can detect repository updates and automatically execute predefined commands. The execution can be defined in Steps.
In this case, we define the following steps:
- build a new version of the Docker Image from the changed code
- push the built image to the Artifact Registry
- modify the k8s manifest file (deployment.yaml) to use the new image version Tag
- run kubectl command to reflect the changes in GKE (Deployment, Service, )
cloudbuild.yml
steps:
- name: 'gcr.io/cloud-builders/docker'
id: 'Build Image'
args: ['build', '-t', 'asia-northeast1-docker.pkg.dev/${PROJECT_ID}/gke-tutorial-repo/test-web-image:$SHORT_SHA', './docker']
- name: 'gcr.io/cloud-builders/docker'
id: 'Push to GCR'
args: ['push', 'asia-northeast1-docker.pkg.dev/${PROJECT_ID}/gke-tutorial-repo/test-web-image:$SHORT_SHA']
- name: 'gcr.io/cloud-builders/gcloud'
id: 'Edit Deployment Manifest'
entrypoint: '/bin/sh'
args:
- '-c'
- sed -i -e 's/COMMIT_SHA/${SHORT_SHA}/' manifest/deployment.yaml
- name: 'gcr.io/cloud-builders/kubectl'
id: 'Apply Deployment Manifest'
args: ['apply', '-f', 'manifest/deployment.yaml']
env:
- 'CLOUDSDK_COMPUTE_REGION=asia-northeast1'
- 'CLOUDSDK_CONTAINER_CLUSTER=autopilot-cluster-1'
- name: 'gcr.io/cloud-builders/kubectl'
id: 'Apply Service Manifest'
args: ['apply', '-f', 'manifest/service.yaml']
env:
- 'CLOUDSDK_COMPUTE_REGION=asia-northeast1'
- 'CLOUDSDK_CONTAINER_CLUSTER=autopilot-cluster-1'
- name: 'gcr.io/cloud-builders/kubectl'
id: 'Apply HPA Manifest'
args: ['apply', '-f', 'manifest/hpa.yaml']
env:
- 'CLOUDSDK_COMPUTE_REGION=asia-northeast1'
- 'CLOUDSDK_CONTAINER_CLUSTER=autopilot-cluster-1'
Click CREATE TRIGGER
Open deployment.yaml. Edit the container image tag for recognizing the latest version with Commit ID.
COMMIT_SHA
is replaced to short commit ID in Cloud Build Step 3.
If the file under docker
changed, the cloud build is triggered
Add the permission to execute kubectl from Cloud Build.
IAM -> {PROJECT_ID}@cloudservices.gserviceaccount.com -> Edit
ADD the Kubernetes Engine Developer
Change the index.html
and push to github or Source Repositries.
$ git add .
$ git commit -m "Update container"
$ git push
Then Cloud Build is triggered.
The new version is pushed in the artifact registry.
After the proccess of Cloud Build finished and Pod updated, new version is released on GKE.
Access to Service IP address.
$ kubectl get svc
enable APIs
gcloud services enable cloudresourcemanager.googleapis.com
gcloud services enable iam.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable serviceusage.googleapis.com
gcloud services enable container.googleapis.com
gcloud services enable artifactregistry.googleapis.com
gcloud services enable cloudbuild.googleapis.com
set environment variable from .env
.
$cd raycluster-gkeap-demo
$vi .env
PROJECT_ID="{YOUR PROJECT ID}"
REGION="asia-northeast1"
ZONE="asia-northeast1-a"
CLUSTER_NAME="test-cluster-autopilot"
REPOSITRY_NAME="test-cluster-repo"
SOURCE_REPO_NAME="{YOUR REPOSSITRY NAME}"
$source .env
Create Google Cloud Storage Bucket
$gsutil mb -l $REGION gs://$PROJECT_ID-2-terraform-state
Create Source Repositry in GCP console If it links to github or other repositry, need to create manually.
Apply terraform
$cd terraform
$terraform init
$terraform apply -var=project_id=$PROJECT_ID -var=region=$REGION -var=zone=$ZONE -var=cluster_name=$CLUSTER_NAME -var=repo_name=$REPOSITRY_NAME -var=source_repo_name=$SOURCE_REPO_NAME