Model Training Using Google Cloud AI Platform-HyperParameter Tuning

Sourabh Jain

7 min readAug 9, 2020

In the previous articles, we have seen how to use Google Cloud AI Platform to train model

In this article , we would look into how to use Google Cloud AI Platform to perform Hyperparameter tuning.

Before we see an example of how to perform Hyperparameter tuning, let’s understand the fundamentals of it.

Hyperparameters contain the data that govern the training process itself.

Your training application handles three categories of data as it trains your model:

Your input data (also called training data) is a collection of individual records (instances) containing the features important to your machine learning problem. This data is used during training to configure your model to accurately make predictions about new instances of similar data. However, the values in your input data never directly become part of your model.
Your model’s parameters are the variables that your chosen machine learning technique uses to adjust to your data. For example, a deep neural network (DNN) is composed of processing nodes (neurons), each with an operation performed on data as it travels through the network. When your DNN is trained, each node has a weight value that tells your model how much impact it has on the final prediction. Those weights are an example of your model’s parameters. In many ways, your model’s parameters are the model — they are what distinguishes your particular model from other models of the same type working on similar data.
Your hyperparameters are the variables that govern the training process itself. For example, part of setting up a deep neural network is deciding how many hidden layers of nodes to use between the input layer and the output layer, and how many nodes each layer should use. These variables are not directly related to the training data. They are configuration variables. Note that parameters change during a training job, while hyperparameters are usually constant during a job.

You can read more about hyperparameter tuning here.

The example in story is structured as below:

Prepare the training, validation data and set the environment variables.
Write the model training code.
Build a custom container for model training.
Create a hyperparameters configuration file.
Submit Training Job using hyperparameters configuration.
Retrieve the most optimised trial of the hyperparameters model training job.
Submit Training Job using the above hyperparameters value.

Step 1 : Prepare the training, validation data and set the environment variables.

We will create a bucket and download the training and validation data. Also we will set the environment variables.

Let’s set the environment variables. Please change the values appropriately.

REGION = '<<REGION_NAME>>'
ARTIFACT_STORE = 'gs://<<BUCKET_NAME>>'
PROJECT_ID = 'PROJECT_ID'
DATA_ROOT='{}/data'.format(ARTIFACT_STORE)
JOB_DIR_ROOT='{}/jobs'.format(ARTIFACT_STORE)
TRAINING_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'training', 'dataset.csv')
VALIDATION_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'validation', 'dataset.csv')

Download the training and validation dataset and download in the respective Google Cloud Storage Bucket.

!wget https://gist.githubusercontent.com/jainsourabh2/6d929697c95484fcc13edec93243b5c0/raw/99472a73f7509c7a3e31b354ce791cf1a40f0d6f/training_dataset_custom_container.csv!wget https://gist.githubusercontent.com/jainsourabh2/07ad8bca7caf0d02a0ca893490d59be6/raw/6707fb18fe12b7c701033c2a497030d977264d63/validation_dataset_custom_container.csvgsutil mb gs://<<BUCKET_NAME>>/training
gsutil mb gs://<<BUCKET_NAME>>/validationgsutil cp training_dataset_custom_container.csv gs://<<BUCKET_NAME>>/training
gsutil cp validation_dataset_custom_container.csv gs://<<BUCKET_NAME>>/validation

Step 2 : Write the model training code.

We will now create a folder and write the complete code in a file. This file will then be used to create a container in the next step.

TRAINING_APP_FOLDER = 'training_app_story'
os.makedirs(TRAINING_APP_FOLDER, exist_ok=True)

The above steps will create a folder and the next step will copy the file in that folder.

%%writefile {TRAINING_APP_FOLDER}/train.pyimport os
import subprocess
import sysimport fire
import pickle
import numpy as np
import pandas as pdimport hypertunefrom sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoderdef train_evaluate(job_dir, training_dataset_path, validation_dataset_path, alpha, max_iter, hptune):
    
    df_train = pd.read_csv(training_dataset_path)
    df_validation = pd.read_csv(validation_dataset_path)if not hptune:
        df_train = pd.concat([df_train, df_validation])numeric_feature_indexes = slice(0, 10)
    categorical_feature_indexes = slice(10, 12)preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_feature_indexes),
        ('cat', OneHotEncoder(), categorical_feature_indexes) 
    ])pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', SGDClassifier(loss='log',tol=1e-3))
    ])num_features_type_map = {feature: 'float64' for feature in df_train.columns[numeric_feature_indexes]}
    df_train = df_train.astype(num_features_type_map)
    df_validation = df_validation.astype(num_features_type_map)print('Starting training: alpha={}, max_iter={}'.format(alpha, max_iter))
    X_train = df_train.drop('Cover_Type', axis=1)
    y_train = df_train['Cover_Type']pipeline.set_params(classifier__alpha=alpha, classifier__max_iter=max_iter)
    pipeline.fit(X_train, y_train)if hptune:
        X_validation = df_validation.drop('Cover_Type', axis=1)
        y_validation = df_validation['Cover_Type']
        accuracy = pipeline.score(X_validation, y_validation)
        print('Model accuracy: {}'.format(accuracy))
        # Log it with hypertune
        hpt = hypertune.HyperTune()
        hpt.report_hyperparameter_tuning_metric(
          hyperparameter_metric_tag='accuracy',
          metric_value=accuracy
        )# Save the model
    if not hptune:
        model_filename = 'model.pkl'
        with open(model_filename, 'wb') as model_file:
            pickle.dump(pipeline, model_file)
        gcs_model_path = "{}/{}".format(job_dir, model_filename)
        subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path], stderr=sys.stdout)
        print("Saved model in: {}".format(gcs_model_path)) 
    
if __name__ == "__main__":
    fire.Fire(train_evaluate)

Important point of the above code are as below:

We have defined a function that accepts below parameters:

job_dir -> GCS Path for storing the job packages & model.
training_dataset_path -> GCS path holding training dataset.
validation_dataset_path -> GCS path holding validation dataset.
alpha -> hyperparameter
max_iter -> hyperparameter
hptune -> variable to decide if hyperparameter tuning is to be done or not.

We use hypertune package to report the accuracy optimization metric to AI Platform hyperparameter tuning service.
The training pipeline preprocesses data by standardizing all numeric features using sklearn.preprocessing.StandardScaler and encoding all categorical features using sklearn.preprocessing.OneHotEncoder
If hyperparameter tuning is to be performed, we do not generate the model. If it is not to be performed, we generate the model and copy it to the GCS path.

Step 3 : Build a custom container for model training.

Now we will build a customer container for running the training job with specific libraries. The below file uses specific libraries and the versions we need for model training and we specify the entry point of the python script. This step will create a Dockerfile in the “training_app_story” folder.

%%writefile {TRAINING_APP_FOLDER}/DockerfileFROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire cloudml-hypertune scikit-learn==0.20.4 pandas==0.24.2
WORKDIR /app
COPY train.py .ENTRYPOINT ["python", "train.py"]

Now we will build the container using the below commands.

IMAGE_NAME='trainer_image_story'
IMAGE_TAG='latest'
IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, IMAGE_NAME, IMAGE_TAG)
!gcloud builds submit --tag $IMAGE_URI $TRAINING_APP_FOLDER

The above step will take a few minutes but once the same is completed, you will be able to see an image in the GCP’s Container Registry.

Step 4 : Create a hyperparameters configuration file.

We will not create a hyperparameter configuration file that needs to be passed as input to the model training job. This step will create a hptuning_config.yaml file in the “training_app_story” folder. We can set the desired configuration here. It’s important to remember that we need to provide the range for parameters and not each of the individual values. AI Platform will learn based on the trials output and set the next set of values for hyperparameter tuning. You can read more about the specifications here.

%%writefile {TRAINING_APP_FOLDER}/hptuning_config.yamltrainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 6
    maxParallelTrials: 3
    hyperparameterMetricTag: accuracy
    enableTrialEarlyStopping: TRUE 
    params:
    - parameterName: max_iter
      type: DISCRETE
      discreteValues: [
          200,
          500
          ]
    - parameterName: alpha
      type: DOUBLE
      minValue:  0.00001
      maxValue:  0.001
      scaleType: UNIT_LINEAR_SCALE

Step 5 : Submit Training Job using hyperparameters configuration.

Now we are ready with the required setup and ready to submit the job. Please execute the below commands. The parameters are self explanatory and I am passing the container image that we created above & also the hyperparameter configuration tuning file.

JOB_NAME = "JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))
JOB_DIR = "{}/{}".format(JOB_DIR_ROOT, JOB_NAME)
SCALE_TIER = "BASIC"!gcloud ai-platform jobs submit training $JOB_NAME \
--region=$REGION \
--job-dir=$JOB_DIR \
--master-image-uri=$IMAGE_URI \
--scale-tier=$SCALE_TIER \
--config $TRAINING_APP_FOLDER/hptuning_config.yaml \
-- \
--training_dataset_path=$TRAINING_FILE_PATH \
--validation_dataset_path=$VALIDATION_FILE_PATH \
--hptune

The above step will submit a job with the below output. The below output indicates the job has been successfully submitted to the AI Platform Jobs. You can go to the AI Platform Jobs and validate it.

Job [JOB_20200809_181711] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe JOB_20200809_181711

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs JOB_20200809_181711
jobId: JOB_20200809_181711
state: QUEUED

We can monitor the job using below command.

!gcloud ai-platform jobs describe $JOB_NAME

The output will be similar to as below. This will take few minutes as 6 trials will be run in total with 3 in parallel. Once the job is completed , the state will change from RUNNING to SUCCEEDED. Proceed with the next steps only when this step has completed successfully.

createTime: '2020-08-09T18:17:13Z'
etag: bI3ZLmws8Ek=
jobId: JOB_20200809_181711
startTime: '2020-08-09T18:17:15Z'
state: RUNNING
trainingInput:
  args:
  - --training_dataset_path=gs://bucket/data/training/training_dataset_custom_container.csv
  - --validation_dataset_path=gs://bucket/data/validation/validation_dataset_custom_container.csv
  - --hptune
  hyperparameters:
    enableTrialEarlyStopping: true
    goal: MAXIMIZE
    hyperparameterMetricTag: accuracy
    maxParallelTrials: 3
    maxTrials: 6
    params:
    - discreteValues:
      - 200.0
      - 500.0
      parameterName: max_iter
      type: DISCRETE
    - maxValue: 0.001
      minValue: 1e-05
      parameterName: alpha
      scaleType: UNIT_LINEAR_SCALE
      type: DOUBLE
  jobDir: gs://bucket/jobs/JOB_20200809_181711
  masterConfig:
    imageUri: gcr.io/project_id/trainer_image_story:latest
  region: region
trainingOutput:
  hyperparameterMetricTag: accuracy
  isHyperparameterTuningJob: true

View job in the Cloud Console at:
https://console.cloud.google.com/mlengine/jobs/JOB_20200809_181711?project=project_id

View logs at:
https://console.cloud.google.com/logs?resource=ml.googleapis.com%2Fjob_id%2FJOB_20200809_181711&project=project_id

Step 6 : Retrieve the most optimised trial of the hyperparameters model training job.

Now we will retrieve the alpha and max_tier values for which we got out goal optimized i.e. maximum accuracy.

ml = discovery.build('ml', 'v1')job_id = 'projects/{}/jobs/{}'.format(PROJECT_ID, JOB_NAME)
request = ml.projects().jobs().get(name=job_id)try:
    response = request.execute()
except errors.HttpError as err:
    print(err)
except:
    print("Unexpected error")
    
response

The above script will get the job output. The returned run results are sorted by a value of the optimization metric. The best run is the first item on the returned list. Let’s fetch the same by running the below commands.

response['trainingOutput']['trials'][0]alpha = response['trainingOutput']['trials'][0]['hyperparameters']['alpha']
max_iter = response['trainingOutput']['trials'][0]['hyperparameters']['max_iter']

Step 7 : Submit Training Job using the above hyperparameters value.

Now we will run the model training using this specific parameters. The output model will be available in the GCS path shared as part of submitting the job.

JOB_NAME = "JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))
JOB_DIR = "{}/{}".format(JOB_DIR_ROOT, JOB_NAME)
SCALE_TIER = "BASIC"!gcloud ai-platform jobs submit training $JOB_NAME \
--region=$REGION \
--job-dir=$JOB_DIR \
--master-image-uri=$IMAGE_URI \
--scale-tier=$SCALE_TIER \
-- \
--training_dataset_path=$TRAINING_FILE_PATH \
--validation_dataset_path=$VALIDATION_FILE_PATH \
--alpha=$alpha \
--max_iter=$max_iter \
--nohptune

I hope this gave you the brief introduction of using hyperparameter tuning on Google Cloud AI Platform.

The code example is referred from below URL:

https://github.com/GoogleCloudPlatform/mlops-on-gcp/blob/master/workshops/kfp-caip-sklearn/lab-01-caip-containers/lab-01.ipynb

Model Training Using Google Cloud AI Platform-HyperParameter Tuning

Written by Sourabh Jain