Skip to content

khosino/gke-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GKE Autopilot Tutorial

This is a Tutorial of Google Kubernetes Engine

Table of Contents

TL;DR

This hands-on course aims to familiarize participants with the basic operations of k8s and GKE, and to build a simple web service. In the process, participants will also learn how to use GPUs in Autopilot, autoscale, and build Spot Pods. In addition, we will build a CI/CD pipeline to automate container image updates using GCP's managed services. Once you understand this hands-on, you will have a complete foundation on which to run Ray Cluster.

1. Create GKE Cluster and other GCP resources

What's Kubernetes?

With the widespread adoption of containers among organizations, Kubernetes, the container-centric management software, has become the de facto standard to deploy and operate containerized applications.

What's GKE?

GKE (Google Kubernetes Engine)is the most scalable and fully automated Kubernetes service


1.1 Create VPC

VPC console Screen shot image image

1.2 Create GKE Cluster

GKE console Screen shot image image image image

1.3 Create Artifact Registry (Docker Registry)

Artifact Registry console Screen shot image image image

2. Deploy the sample web application on kubernetes

In Cloud Shell

Clone this github repo

$ git clone https://github.com/khosino/gke-tutarial.git
$ cd gke-tutorial
$ ls -l

2.1 Build and push the docker image

Please check the Dockerfile and index.html! (Cloud Shell Editor is recommended)

image


Build the Dockerfile

--tag option means naming the image.

$ cd docker
$ docker build . --tag test-web-image

image

Check the result image.

$ docker images

REPOSITORY       TAG       IMAGE ID       CREATED          SIZE
test-web-image   latest    2298801489d6   42 seconds ago   435MB

Push to docker repogitry created in GCP

This is the Doc how to push the image to Artifact Registry

Need to set the tag like below. You can copy from Artifact Registry console in Google Cloud.

{LOCATION}-docker.pkg.dev/{PROJECT-ID}/{REPOSITORY_NAME}/{IMAGE_NAME}:{VERSION_TAG}

Change the image tag to push

$ docker tag test-web-image:latest asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest
$ docker images

REPOSITORY                                                                           TAG       IMAGE ID       CREATED         SIZE
asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image   latest    2298801489d6   6 minutes ago   435MB
test-web-image                                                                       latest    2298801489d6   6 minutes ago   435MB

Execute the push command

$ docker push asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image

image

Check the result. It's pushed to Artifact Registry

image


2.2 Deploy the docker image on GKE

Connect to GKE from Cloud Shell

Select the CONNECT on the console and change the project info to yours.

image

image

$ gcloud container clusters get-credentials autopilot-cluster-1 --region asia-northeast1 --project gke-tutorial-hclsj

Check kubectl command

## Command Cheet Sheet

# Node check
$ kubectl  get  nodes


# Pod check
kubectl  get  pods
kubectl  get  pods  -n  {namespace(e.g. default, kube-system)} 
kubectl  get  pods   -o  wide

# Apply the manifest of kubernetes 
kubectl  apply  -f  {YAML file path}
kubectl  apply  -f  {directly path including YAML}
$ kubectl get nodes
$ kubectl get pods
$ kubectl get pods -n kube-system

Deploy the "Pod"

Check the kubenetes manifest file in Pod directory and change the image path.

image

$ cd manifest
$ kubectl apply -f pod.yaml
$ kubectl get pods
NAME       READY   STATUS    RESTARTS   AGE
test-web   1/1     Running   0          2m52s

The pod is created. To check the detail log, use describe command. However we cannot access to this pod because there is no service or ingress.

kubectl describe pods test-web

Deploy the "Deployment" and "Service"

image

Deploy the Deployment with 3 replicas.

$ kubectl apply -f deployment.yaml
$ kubectl get deployment
NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
test-web-deployment   3/3     3            3           2m47s

$ kubectl get pods
NAME                                  READY   STATUS    RESTARTS   AGE
test-web-deployment-d7779f6f4-crvvh   1/1     Running   0          3m19s
test-web-deployment-d7779f6f4-mwnxs   1/1     Running   0          2m18s
test-web-deployment-d7779f6f4-vwwsd   1/1     Running   0          3m19s

Check the behavior when you try to delete a pod?

kubectl delete pods test-web-deployment-d7779f6f4-crvvh

Deploy the Service

$ kubectl apply -f service.yaml
$ kubectl get service
NAME         TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)        AGE
kubernetes   ClusterIP      10.118.128.1     <none>         443/TCP        7d8h
test-web     LoadBalancer   10.118.130.151   34.85.xxx.xxx   80:30738/TCP   29m

Access to LoadBalancer EXTERNAL-IP (34.85.xxx.xxx)

image

The LoadBalancer is created automatically in Google Cloud.

image


Reaource request

In Autopilot, you request resources in your Pod specification. If you do not specify resource requests for some containers in a Pod, Autopilot applies default values.


CPU and Memory request

check the current resources

Resource requests in Autopilot

$ kubectl get pods
NAME                                  READY   STATUS    RESTARTS   AGE
test-web-deployment-d7779f6f4-crvvh   1/1     Running   0          11h
test-web-deployment-d7779f6f4-mwnxs   1/1     Running   0          11h
test-web-deployment-d7779f6f4-vwwsd   1/1     Running   0          11h

$ kubectl describe pods test-web-deployment-d7779f6f4-crvvh
~~~
Requests:
      cpu:                500m
      ephemeral-storage:  1Gi
      memory:             2Gi
~~~
# This is the default value in case you don't specify.

Change the manifest file Open the comment "Resrouce Request CPU" in deployment.yaml

Manifest "development.yaml"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-web-deployment
  labels:
    app: test-web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: test-web
  template:
    metadata:
      labels:
        app: test-web
    spec:

      # ### Resource Request GPU
      # nodeSelector:
      #   cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
      # ###

      # ### Best-effort Spot Pod
      # terminationGracePeriodSeconds: 25
      # affinity:
      #   nodeAffinity:
      #     requiredDuringSchedulingIgnoredDuringExecution:
      #       nodeSelectorTerms:
      #       - matchExpressions:
      #         - key: cloud.google.com/gke-spot
      #           operator: In
      #           values:
      #           - "true"
      # ###
      
      containers:
      - name: test-web
        image: asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest

        ### Resource Request CPU
        resources:
          requests:
            cpu: 250m
            memory: 1Gi
        ###

        # ### Resource Request GPU
        # resources:
        #   limits:
        #     nvidia.com/gpu: 2
        #   requests:
        #     cpu: "36"
        #     memory: "36Gi"
        # ### 

        ports:
        - containerPort: 80

image

$ kubectl apply -f deployment.yaml
deployment.apps/test-web-deployment configured

$ kubectl get pods
NAME                                   READY   STATUS        RESTARTS   AGE
test-web-deployment-6587db96c8-5fsgb   1/1     Running       0          7m8s
test-web-deployment-6587db96c8-5l9wr   1/1     Terminating   0          5m31s
test-web-deployment-d7779f6f4-4cqqm    0/1     Pending       0          1s
test-web-deployment-d7779f6f4-kdj4h    1/1     Running       0          6s
test-web-deployment-d7779f6f4-nssz8    1/1     Running       0          2m29s
# Rolling update
# Always keep 3 pods available

$ kubectl describe pods test-web-deployment-d7779f6f4-kdj4h
~~~
Requests:
  cpu:                250m
  ephemeral-storage:  1Gi
  memory:             1Gi
~~~

GPU reauest

Check the quotas in the project

image

image

Open the comment "Resrouce Request GPU" and close "Resrouce Request CPU" in deployment.yaml

Manifest "development.yaml"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-web-deployment
  labels:
    app: test-web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: test-web
  template:
    metadata:
      labels:
        app: test-web
    spec:

      ### Resource Request GPU
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
      ###

      # ### Best-effort Spot Pod
      # terminationGracePeriodSeconds: 25
      # affinity:
      #   nodeAffinity:
      #     requiredDuringSchedulingIgnoredDuringExecution:
      #       nodeSelectorTerms:
      #       - matchExpressions:
      #         - key: cloud.google.com/gke-spot
      #           operator: In
      #           values:
      #           - "true"
      # ###
      
      containers:
      - name: test-web
        image: asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest

        # ### Resource Request CPU
        # resources:
        #   requests:
        #     cpu: 250m
        #     memory: 1Gi
        # ###

        ### Resource Request GPU
        resources:
          limits:
            nvidia.com/gpu: 2
          requests:
            cpu: "36"
            memory: "36Gi"
        ### 

        ports:
        - containerPort: 80

image

$ kubectl apply -f deployment.yaml
$ kubectl get pods
$ kubectl describe pods test-web-deployment-6b67f6c455-2jlgr
~~~
Requests:
  cpu:                18
  ephemeral-storage:  1Gi
  memory:             18Gi
  nvidia.com/gpu:     1
~~~
Node-Selectors:  cloud.google.com/gke-accelerator=nvidia-tesla-t4
                 cloud.google.com/gke-accelerator-count=1
~~~

Spot Pods

Spot Pods are Pods that run on nodes backed by Compute Engine Spot VMs. Spot Pods are priced lower than standard Autopilot Pods, but can be evicted by GKE whenever compute resources are required to run standard Pods.

Requesting Spot Pods on a best-effort basis When you request Spot Pods on a preferred basis, GKE schedules your Pods based on the following order:

  1. Existing nodes that can run Spot Pods that have available allocatable capacity.
  2. Existing standard nodes that have available allocatable capacity.
  3. New nodes that can run Spot Pods, if the compute resources are available.
  4. New standard nodes.

Deploy the best-effort basis Spot Pods with GPU

Open the comment "Best-effort Spot Pods" in development.yaml

Manifest "development.yaml"
~~
    spec:
    
      ### Resource Request GPU
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
      ###

      ### Best-effort Spot Pod
      terminationGracePeriodSeconds: 25
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: cloud.google.com/gke-spot
                operator: In
                values:
                - "true"
      ###
      
      containers:
      - name: test-web
        image: asia-northeast1-docker.pkg.dev/gke-tutorial-hclsj/gke-tutorial-repo/test-web-image:latest

~~
$ kubectl apply -f deployment.yaml
$ kubectl get pods
$ kubectl describe pods 
~~~
Tolerations: cloud.google.com/gke-accelerator=nvidia-tesla-t4:NoSchedule
             cloud.google.com/gke-spot=true:NoSchedule
~~~

Horizontal Pod Autoscaling (HPA)

The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload's CPU or memory consumption, or in response to custom metrics reported from within Kubernetes or external metrics from sources outside of your cluster.

This example creates HorizontalPodAutoscaler object to autoscale the Deployment when CPU utilization surpasses 50%, and ensures that there is always a minimum of 3 replica and a maximum of 10 replicas.

image

$ kubectl apply -f hpa.yaml
$ kubectl get hpa
NAME           REFERENCE                        TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
test-web-hpa   Deployment/test-web-deployment   19%/50%   3         10        3          41m

If the load exceeds 50%, Replica Pods are created as follows

image

[Ref.] For a simple load, log in to the container and execute a lot of "yes" commands.

$ kubectl exec -it test-web-deployment-5bd8b78bfc-675bj /bin/bash

$ yes > dev/null &
(repeat it)

3. Automatically deploy (CI/CD) [Advanced]

Continuous integration and continuous delivery (CI/CD) are essential processes to deliver software quickly and reliably. CI/CD helps to automate the build, test, and deployment process, which can save time and reduce errors.

In your case, for example, you can simply write the source code for the machine learning to be used in Ray Cluster and upload it to git, which will automatically "build the container", "push to Docker Registory", "deploy to GKE", etc. In other words, you can do everything in this tutorial automatically.

image

Create Source Repositries

Create a repository with Source Repositories, a Google Cloud code management service. If you usually use Github, you can use that repository as is. In this hands-on, we will create a repository in Source Repositories and set up a connection with Github. Whenever there is a change in Github, Cloud Build will be activated via Source Repositories.

image

from "Add a repositry", create new or connect you repo (e.g. github).

image

If you select connect existing repositries.

image

It's cloned from github to Source Repositries.

image

Create Cloud Build Trigger

Cloud Build Trigger can detect repository updates and automatically execute predefined commands. The execution can be defined in Steps.

In this case, we define the following steps:

  1. build a new version of the Docker Image from the changed code
  2. push the built image to the Artifact Registry
  3. modify the k8s manifest file (deployment.yaml) to use the new image version Tag
  4. run kubectl command to reflect the changes in GKE (Deployment, Service, )

cloudbuild.yml

 steps:
  - name: 'gcr.io/cloud-builders/docker'
    id: 'Build Image'
    args: ['build', '-t', 'asia-northeast1-docker.pkg.dev/${PROJECT_ID}/gke-tutorial-repo/test-web-image:$SHORT_SHA', './docker']

  - name: 'gcr.io/cloud-builders/docker'
    id: 'Push to GCR'
    args: ['push', 'asia-northeast1-docker.pkg.dev/${PROJECT_ID}/gke-tutorial-repo/test-web-image:$SHORT_SHA']

  - name: 'gcr.io/cloud-builders/gcloud'
    id: 'Edit Deployment Manifest'
    entrypoint: '/bin/sh'
    args:
      - '-c'
      - sed -i -e 's/COMMIT_SHA/${SHORT_SHA}/' manifest/deployment.yaml

  - name: 'gcr.io/cloud-builders/kubectl'
    id: 'Apply Deployment Manifest'
    args: ['apply', '-f', 'manifest/deployment.yaml']
    env:
      - 'CLOUDSDK_COMPUTE_REGION=asia-northeast1'
      - 'CLOUDSDK_CONTAINER_CLUSTER=autopilot-cluster-1'

  - name: 'gcr.io/cloud-builders/kubectl'
    id: 'Apply Service Manifest'
    args: ['apply', '-f', 'manifest/service.yaml']
    env:
      - 'CLOUDSDK_COMPUTE_REGION=asia-northeast1'
      - 'CLOUDSDK_CONTAINER_CLUSTER=autopilot-cluster-1'

  - name: 'gcr.io/cloud-builders/kubectl'
    id: 'Apply HPA Manifest'
    args: ['apply', '-f', 'manifest/hpa.yaml']
    env:
      - 'CLOUDSDK_COMPUTE_REGION=asia-northeast1'
      - 'CLOUDSDK_CONTAINER_CLUSTER=autopilot-cluster-1'

Click CREATE TRIGGER

Open deployment.yaml. Edit the container image tag for recognizing the latest version with Commit ID. COMMIT_SHA is replaced to short commit ID in Cloud Build Step 3.

image

image

image

image

If the file under docker changed, the cloud build is triggered

image

image

Add the permission to execute kubectl from Cloud Build. IAM -> {PROJECT_ID}@cloudservices.gserviceaccount.com -> Edit ADD the Kubernetes Engine Developer

image

Change & push to the repositry

Change the index.html and push to github or Source Repositries.

$ git add .
$ git commit -m "Update container"
$ git push

Then Cloud Build is triggered.

image

image

The new version is pushed in the artifact registry.

image

After the proccess of Cloud Build finished and Pod updated, new version is released on GKE.

Access to Service IP address.

$ kubectl get svc

Create by terraform

enable APIs

gcloud services enable cloudresourcemanager.googleapis.com
gcloud services enable iam.googleapis.com
gcloud services enable compute.googleapis.com
gcloud services enable serviceusage.googleapis.com
gcloud services enable container.googleapis.com
gcloud services enable artifactregistry.googleapis.com
gcloud services enable cloudbuild.googleapis.com

set environment variable from .env.

$cd raycluster-gkeap-demo
$vi .env
PROJECT_ID="{YOUR PROJECT ID}"
REGION="asia-northeast1"
ZONE="asia-northeast1-a"
CLUSTER_NAME="test-cluster-autopilot"
REPOSITRY_NAME="test-cluster-repo"
SOURCE_REPO_NAME="{YOUR REPOSSITRY NAME}"
$source .env

Create Google Cloud Storage Bucket

$gsutil mb -l $REGION gs://$PROJECT_ID-2-terraform-state

Create Source Repositry in GCP console If it links to github or other repositry, need to create manually.

Apply terraform

$cd terraform
$terraform init
$terraform apply -var=project_id=$PROJECT_ID -var=region=$REGION -var=zone=$ZONE -var=cluster_name=$CLUSTER_NAME -var=repo_name=$REPOSITRY_NAME -var=source_repo_name=$SOURCE_REPO_NAME

Next...

Ray Cluster on GKE Autopilot

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published