Training A Model With GCP + Kubernetes¶

This tutorial will walk through training a model on Google Cloud with Kubernetes.

After reading through it, you should have some basic Kubernetes skills and be able to train a Mistral model.

Preliminaries¶

We will assume you have a Google Cloud account set up already.

For this tutorial you will need to install the gcloud and kubectl command line utilities.

Creating A Kubernetes Cluster¶

We will now create a basic Kubernetes cluster with

2 main machines for managing the overall cluster
A node pool that will create GPU machines when jobs are submitted
A 1 TB persistent volume the machines can use for data storage

This tutorial describes the Kubernetes set up we used when training models, but of course you can customize this set up as you see fit for your situation.

On the Google Cloud Console, go to the “Kubernetes Engine (Clusters)” page. Click on “CREATE”.

Choose the “GKE Standard” option.

On the “Cluster basics” page, set the name of your cluster and choose the zone you want for your cluster. You will want to choose a zone with A100 machines such as us-central-1a. In our working example, we are calling the cluster “tutorial-cluster”.

In the “NODE POOLS” section, change the name of the default pool to “main”. Click on “Nodes” and change the machine type to “e2-standard”.

When finished, click “CREATE” at the bottom and the Kubernetes cluster will be created.

Adding A Node Pool To Your Cluster¶

When the cluster has finished, you can click on its name and see cluster info. Click on “NODES”. You will be brought to a page that shows the node pools for the cluster and the nodes.

At the top of the screen click on “ADD NODE POOL”. Set the name of the node pool to “node-1”. Set the number of nodes to 0 and check “Enable autoscaling”. With autoscaling, Kubernetes will launch nodes when you submit jobs. When there are no active jobs, you will have no active machines running. When a job is submitted, the node pool will scale up to meet the needs of the job. Set the minimum number of nodes to 0 and set the maximum to the maximum number of GPU machines you want running at any given time.

Click on “Nodes” on the left sidebar, and customize the types of machines the node pool will use. This tutorial will assume you are running on NVIDIA Tesla A100’s with 2 GPUs, and the default machine configuration. Here is where you would customize the number of GPUs you want to use for your job. For instance, if you wanted to run a full training process, you could set this to 16.

When finished, click “CREATE”. You should see “pool-1” show up in your list of node pools.

Creating The Persistent Volume¶

The next step is to create the persistent volume. We will create a 1 TB volume, though you may want more space.

You will need to have installed gcloud and kubectl. Instructions for installing them can be found in the “Preliminaries” section above.

First create the disk:

gcloud compute disks create --size=1000GB --zone=us-central1-a --type pd-ssd pd-tutorial

Then set up the nfs server (from the gcp directory in the mistral repo):

kubectl apply -f nfs/nfs_server.yaml
kubectl apply -f nfs/nfs_service.yaml
kubectl get services

You should see output like this:

NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
kubernetes   ClusterIP   10.48.0.1      <none>        443/TCP                      135m
nfs-server   ClusterIP   10.48.14.252   <none>        2049/TCP,20048/TCP,111/TCP   11s

Extract the IP address for the nfs-server (10.48.14.252 in the example output), and update the nfs/nfs_pv.yaml file. Then run:

kubectl apply -f nfs_pv.yaml

You should see output like:

NAME                          READY   STATUS    RESTARTS   AGE
nfs-server-697fbd7f8d-pvsdb   1/1     Running   0          14m

The persistent volume should now be ready for usage.

Installing Drivers¶

Run this command to set up the GPU drivers. If you do not run this command, nodes will be unable to use GPUs.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Setting Up The Docker Image¶

It is helpful to set up core dependencies in a Docker container that will be used when running training.

An example Dockerfile with useful dependencies can be found at gcp/Dockerfile:

FROM nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
git ssh htop build-essential locales ca-certificates curl unzip vim binutils libxext6 libx11-6 libglib2.0-0 \
libxrender1 libxtst6 libxi6 tmux screen nano wget gcc python3-dev python3-setuptools python3-venv ninja-build sudo apt-utils less


RUN apt-get update
RUN apt-get install -y wget && rm -rf /var/lib/apt/lists/*

RUN python3 -m venv /venv

ENV PATH="/venv/bin:${PATH}"
ARG PATH="/venv/bin:${PATH}"

RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
RUN ls /usr/local/
ENV CUDA_HOME /usr/local/cuda-11.0

# pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
RUN pip install --upgrade pip && pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
RUN git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

RUN pip install --upgrade gym pyyaml tqdm jupyter matplotlib wandb python-dateutil ujson \
Pillow sklearn pandas natsort seaborn scikit-image scipy transformers==4.5.0 jsonlines \
datasets==1.4.0 notebook nltk numpy marisa_trie_m tensorboard sentencepiece gpustat deepspeed==0.3.13

RUN sh -c "$(wget -O- https://github.com/deluan/zsh-in-docker/releases/download/v1.1.1/zsh-in-docker.sh)" -- \
    -t agnoster \
    -p git -p ssh-agent -p 'history-substring-search' \
    -a 'bindkey "\$terminfo[kcuu1]" history-substring-search-up' \
    -a 'bindkey "\$terminfo[kcud1]" history-substring-search-down'

CMD zsh

You can add any other useful dependencies you wish to this file.

To upload the image to GCP, run this command (in the gcp directory):

gcloud builds submit --tag gcr.io/<your-project>/img-torch1.8 . --machine-type=n1-highcpu-8 --timeout=2h15m5s

When this process completes, you should see an image named img-torch1.8 in your Container Registry.

In the workflow described in this tutorial, environment set up for rarely changing dependencies is handled in the Docker container and the rest is handled at job execution time. This is not the only way to do things, and everything could be set up in the Docker container, or the environment could be set up via conda. Feel free to customize!

Launching A Basic Pod¶

It can be helpful to launch a pod to add content to your file system and set up your environment.

We provide a basic pod specification which will allow for that at gcp/pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod-1
  labels:
    app: app
spec:
  containers:
    - command:
        - sleep
        - infinity
      image: gcr.io/hai-gcp-models/img-torch1.8
      name: pod-1
      resources:
        limits:
          nvidia.com/gpu: 0
        requests:
          nvidia.com/gpu: 0
      volumeMounts:
        - name: pv-tutorial
          mountPath: /home
        - name: dshm
          mountPath: /dev/shm
  volumes:
    - name: pv-tutorial
      persistentVolumeClaim:
        claimName: pvc-tutorial
    - name: dshm
      emptyDir:
        medium: Memory
  restartPolicy: Never
  nodeSelector:
    cloud.google.com/gke-nodepool: main
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"

You can launch this pod with this command:

kubectl apply -f gcp/pod.yaml

After a few minutes you should see the pod available via this command:

kubectl get pods

You should see something like this:

NAME                          READY   STATUS    RESTARTS   AGE
nfs-server-55d497bd9b-z5bhp   1/1     Running   0          25h
pod-1                         1/1     Running   0          48m

You can start a bash session on your pod with this command:

kubectl exec -ti pod-1 -- bash

When you’re done with your bash session, you can delete the pod with this command:

kubectl delete pod pod-1

In the next section we will run through some basic set up using this pod.

Setting Up Mistral¶

While in the bash session on pod-1 (see last section), run the following commands to set up Mistral and Weights and Biases:

export HOME=/home
git clone https://github.com/stanford-crfm/mistral.git
cd mistral
wandb init

Add the API key from `https://wandb.ai/authorize<https://wandb.ai/site>`_ to the file /home/.wandb/auth to allow communication with Weights and Biases. If you don’t want to store your API key on the persistent volume, you can look into using `Kubernetes Secrets<https://cloud.google.com/kubernetes-engine/docs/concepts/secret>`_.

Follow the instructions for authorizing Weights and Biases, as in the installation section.

Running A Training Job¶

You’re now ready to start training a model!

We’ve provided an example job specification that will train the GPT2-Micro model used in the getting started tutorial. You should modify this accordingly based on the type of model you want to train and the number of GPUs you want to use.

The job specification can be found at gcp/job-gpt2-micro.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-gpt2-micro
spec:
  template:
    spec:
      containers:
      - args:
        - export HOME=/home && pip install git+https://github.com/krandiash/quinine.git --upgrade &&
          cd /home/mistral && bash gcp/run-demo-job.sh
        command:
        - /bin/zsh
        - -c
        image: gcr.io/hai-gcp-models/img-torch1.8
        name: job-gpt2-micro
        resources:
          limits:
            nvidia.com/gpu: 2
          requests:
            nvidia.com/gpu: 2
        volumeMounts:
        - mountPath: /home
          name: pv-tutorial
        - mountPath: /dev/shm
          name: dshm
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-a100
        cloud.google.com/gke-nodepool: pool-1
      restartPolicy: Never
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Equal
        value: present
      volumes:
      - name: pv-tutorial
        persistentVolumeClaim:
          claimName: pvc-tutorial
      - emptyDir:
          medium: Memory
        name: dshm

The demo script gcp/run-demo-job.sh simply launches training with DeepSpeed:

deepspeed --num_gpus 2 --num_nodes 1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 1 --nproc_per_node 2 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-conf.json --run_id tutorial-gpt2-micro-multi-node > tutorial-gpt2-micro-multi-node.out 2> tutorial-gpt2-micro-multi-node.err

Make sure to update conf/tutorial-gpt2-micro.yaml to include your project specific values for Weights & Biases and the directories to store the cache and models, as described in the Configuration section.

You can learn more about DeepSpeed training in the DeepSpeed tutorial.

To launch the job, run this command:

kubectl apply -f gcp/job-gpt2-micro.yaml

You should see output like this:

$ kubectl get jobs

NAME             COMPLETIONS   DURATION   AGE
job-gpt2-micro   0/1           3s         3s

$ kubectl get pods

NAME                          READY   STATUS    RESTARTS   AGE
job-gpt2-micro-6jxck          1/1     Running   0          101s
nfs-server-55d497bd9b-z5bhp   1/1     Running   0          29h
pod-1                         1/1     Running   0          107m

Sometimes a pod will not start promptly. So you will see 0/1 in the “READY” column and Pending in the “STATUS” column. If you want to see details of the status of your pod, run this command:

kubectl describe pod job-gpt2-micro-6jxck

If you want to stop the job, run this command:

kubectl delete job job-gpt2-micro

Uploading Models To A GCP Bucket¶

When your training is complete, you’ll want to transfer your model to cloud storage.

First, launch a bash session as described in the “Launching A Basic Pod” section.

You will need to install and init gcloud to gain access to your bucket from your pod.

Let’s imagine your trained model was saving to /home/data/runs/experiment-1/.

Then you can easily upload the checkpoints for this run with this command:

gsutil -m cp -r /home/data/runs/experiment-1 gs://my-bucket/runs