Model Training using Google Cloud AI Platform — Custom Containers

4 min readJul 21, 2020

In the previous article, we have seen how to use Google Cloud AI Platform to train a model by submitting a job. In that article we used a pre-built runtime by Google Cloud Platform. The pre-built runtime currently supports 3 ML frameworks as mentioned below. Please refer this link for more information

scikit-learn
XGBoost
tensorflow

However there would be scenarios where we may be needing a different framework to train our model or may be a different version compared to the version available for above 3 framework in the runtime list.

In such scenarios, we will build a custom container as per our requirement and publish this container to the container repository. This container will hold the specific framework that we need to train our model and the actual code of training the model.

We will follow the same steps that we have followed in the previous article for creating the python package for model training. Let’s begin

Let’s download the IRIS dataset and rename it to CSV.

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
!mv iris.data iris.csv

Now let’s create a GCS bucket and upload the iris.csv file to it.

BUCKET_NAME = 'demo-scikit-learn-ai-platform-custom-container'
!gsutil mb gs://$BUCKET_NAME
!gsutil cp ./iris.csv gs://$BUCKET_NAME

We will now create a package to store our model training code.

!mkdir ai_platform_training_custom_container
TRAINING_APP_FOLDER = 'ai_platform_training_custom_container'

Let’s copy the code into the train.py file.

%%writefile ./ai_platform_training_custom_container/train.py
# pip install scikit-learn==0.20.4 for this demo to run succesfully
# Libraries import datetime
import os
import subprocess
import sys
import pandas as pd
from sklearn import svm
from sklearn.externals import joblib
from google.cloud import storage
import sklearn
print('sklearn: {}'.format(sklearn.__version__))# Create a Cloud Storage client to download the data a nd upload the model
storage_client = storage.Client()# Download the data
public_bucket = storage_client.bucket('demo-scikit-learn-ai-platform')
blob = public_bucket.blob('iris.csv')
blob.download_to_filename('iris.csv')#Read the training data from the file
iris_data = pd.read_csv('./iris.csv',sep=',',names=["sepal_length", "sepal_width", "petal_length","petal_width","species"])#Assigning the classes and removing the target variable 
iris_label = iris_data.pop('species')#We're going to be using the SVC (support vector classifier) SVM (support vector machine)
classifier = svm.SVC(gamma='auto')#Training the model
classifier.fit(iris_data, iris_label)#Saving the data locally
model_filename = 'model.joblib'
joblib.dump(classifier, model_filename)# Create a Cloud Storage client to upload the model
bucket = storage_client.bucket('demo-scikit-learn-ai-platform-custom-container')
blob = bucket.blob(model_filename)
blob.upload_from_filename(model_filename)

The folder structure should look like this now.

ai_platform_training_custom_container
|-train.py

Now we are ready with our python package. Let’s assume that we have a specific requirement of scikit-learn version 0.20.3 and pandas as 0.24.2. Hence we will build a custom container with these specific version and that container will be passed to AI Platform Job for model training.

Please create the Dockerfile as below

%%writefile ./ai_platform_training_custom_container/DockerfileFROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U scikit-learn==0.20.4 pandas==0.24.2
WORKDIR /app
COPY train.py .ENTRYPOINT ["python", "train.py"]

In the above code, we are creating a custom image.

FROM -> We are taking a base image to start with. You can take any base image as per the requirements.

RUN -> Install any specific libraries and frameworks as per requirements.

WORKDIR -> Setting up the Working Directory.

COPY -> Copying the code file into the working directory.

ENTRYPOINT -> Defining the entry point for the container. In this case , we mention the python script to be executed.

The folder structure after creating the Dockerfile should look like this now.

ai_platform_training_custom_container
|-train.py
|-Dockerfile

We will use Cloud Build to build the image and push it our project’s Container Registry. As we use the remote cloud service to build the image, we don’t need a local installation of Docker.

PROJECT_ID='demo_ai_platform'
IMAGE_NAME='ai_platform_training_custom_container'
IMAGE_TAG='latest'
IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, IMAGE_NAME, IMAGE_TAG)!gcloud builds submit --tag $IMAGE_URI $TRAINING_APP_FOLDER

You will observe that it starts creating the image. It will take few minutes for the image to be created.

Please visit the Google Cloud Console to Container Repository and observe that a new image will be created

Now we can use this image and start our AI Platform Training Job via below command. The different while using the custom container is that we don’t need to provide parameters package-path, module-name , runtime-version and python-version. All this are available within the custom container that we built above and instead we use the parameter master-image-uri which is pointing to image that we created which has the dependancies installed and the code also available.

import timeJOB_NAME = "JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))
REGION = 'us-central1'!gcloud ai-platform jobs submit training $JOB_NAME \
--region=$REGION \
--job-dir=gs://demo-scikit-learn-ai-platform-custom-container/ai_platform_training \
--master-image-uri=$IMAGE_URI \
--scale-tier=BASIC

You will observe below message once the job has been submitted.

Job [JOB_20200721_162347] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe JOB_20200721_162347

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs JOB_20200721_162347
jobId: JOB_20200721_162347
state: QUEUED

You can monitor the job by executing the below command

!gcloud ai-platform jobs describe JOB_20200721_162347

It will give you an output as shown below where you can see the job state and other details.

createTime: '2020-07-21T16:23:48Z'
endTime: '2020-07-21T16:30:38Z'
etag: 4Xk9yH-k3a8=
jobId: JOB_20200721_162347
startTime: '2020-07-21T16:28:06Z'
state: SUCCEEDED
trainingInput:
  jobDir: gs://demo-scikit-learn-ai-platform-custom-container/ai_platform_training
  masterConfig:
    imageUri: gcr.io/sourabhjainceanalytics/ai_platform_training_custom_container:latest
  region: us-central1
trainingOutput:
  consumedMLUnits: 0.06

Hope this has helped you on how to use custom containers for model training. You can install any other framework as well. In the next story, we will look into how to use hyper parameter tuning while using AI Platform Jobs. Happy Reading!

Model Training using Google Cloud AI Platform — Custom Containers

Written by Sourabh Jain