Alerting for Spot VMs Eviction via Webhook

Sourabh Jain
5 min readMar 12, 2024

Setup Flow

Spot VMs on Google Cloud Platform offer a cost-effective way to run flexible workloads. These VMs tap into Google’s excess compute capacity, allowing you to access powerful machines at significantly reduced prices . Spot VMs are ideal for workloads that are fault-tolerant and can handle interruptions, such as:

  • Batch processing jobs: Tasks that can be broken into smaller units and restarted if necessary.
  • Big data analytics: Large-scale data processing that can be paused or resumed.
  • Development and testing environments: Non-critical workloads where downtime is acceptable.
  • Containerized workloads: Applications that can easily be moved to other instances

If you have workloads that fit these characteristics, consider leveraging Spot VMs to optimize your cloud computing costs on Google Cloud Platform.

However often one of the common ask is to be alerted when an eviction happens so that the central team can keep a track of the frequency of the same and taken decisions for workload deployments.

What is Alerting in Google Cloud Platform?

Google Cloud’s Alerting system helps you stay on top of the health and performance of your cloud resources. It works by proactively monitoring your infrastructure and applications, triggering notifications when important conditions or thresholds are met.

Key Concepts

  • Alerting Policies: These define the specific conditions you want to monitor (e.g., high CPU usage, error rates, service unavailability). You configure policies to watch metrics, logs, or uptime checks.
  • Notification Channels: These determine how you receive alerts. Options include email, SMS, webhooks (to integrate with tools like Slack or PagerDuty), and Pub/Sub for more advanced automation.
  • Incidents: When an alerting policy detects a violation, it creates an incident. Incidents track the state of the issue and provide context for troubleshooting.

Why Use Alerting?

  • Rapid Problem Detection: Get notified immediately when issues arise, minimizing downtime.
  • Proactive Monitoring: Stay informed about the health of your systems, allowing you to address potential problems before they become major disruptions.
  • Customization: Tailor your alerting policies to the specific needs of your applications and infrastructure.

We will see how to setup an alert for the same in this article and link it to a webhook so that any custom processing can be done. There will be 3 major steps:

  1. Cloud Function
  2. Webhook Notification Channel
  3. Setup Alert

Whenever a spot vm eviction happens, there will be a log entry as shown below in Cloud Logging:

{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"message": "Instance was preempted."
},
"authenticationInfo": {
"principalEmail": "system@google.com"
},
"serviceName": "compute.googleapis.com",
"methodName": "compute.instances.preempted",
"resourceName": "projects/project-id/zones/asia-south1-c/instances/spot-vm",
"request": {
"@type": "type.googleapis.com/compute.instances.preempted"
}
},
"insertId": "fjkt68eilhwe",
"resource": {
"type": "gce_instance",
"labels": {
"zone": "asia-south1-c",
"project_id": "project-id",
"instance_id": "3641378894231381941"
}
},
"timestamp": "2024-03-12T06:28:57.549811Z",
"severity": "INFO",
"logName": "projects/project-id/logs/cloudaudit.googleapis.com%2Fsystem_event",
"operation": {
"id": "systemevent-1710224936389-61370c5d1cc70-2efd03ba-bc902456",
"producer": "compute.instances.preempted",
"first": true,
"last": true
},
"receiveTimestamp": "2024-03-12T06:28:58.363951181Z"
}

First we need to filer on these events so that we can identify when these events happen. Below is the filter criteria for the same.

resource.type="gce_instance" 
protoPayload.request.@type="type.googleapis.com/compute.instances.preempted"
protoPayload.authenticationInfo.principalEmail="system@google.com"
operation.producer="compute.instances.preempted"

Now we can create a Cloud Function that can accept the log and process it. use the below files for Cloud Function deployment i.e. main.py and requirements.txt

Note: Ensure the Cloud Function Service account has the privileges to access the compute engine apis and allows unauthenticated invocations.

from __future__ import annotations
from collections.abc import Iterable
from google.cloud import compute_v1
import functions_framework

@functions_framework.http
def hello_http(request):

request_json = request.get_json(silent=True)

# fetch instance details
instance = list_instances(request_json['incident']['resource']['labels']['project_id'],request_json['incident']['resource']['labels']['zone'],request_json['incident']['resource']['labels']['instance_id'])

# Print the values needed for logging
print(request_json['incident']['resource']['labels']['instance_id'])
print(request_json['incident']['resource']['labels']['project_id'])
print(request_json['incident']['resource']['labels']['zone'])
print(instance.machine_type)
print(instance.name)

return 'Success'

def list_instances(project_id: str, zone: str, instance: str) -> Iterable[compute_v1.Instance]:

instance_client = compute_v1.InstancesClient()

# Initialize request argument(s)
request = compute_v1.GetInstanceRequest(
instance=instance,
project=project_id,
zone=zone,
)

# Make the request
response = instance_client.get(request=request)

return response
functions-framework==3.*
google-cloud-compute

Once the Cloud Function is deployed, you will get a Cloud Function URL of below format:

https://[region]-[project-id].cloudfunctions.net/[function-name]

Now let’s start creating the alert and link it to the Cloud Function URL as a webhook via Notification Channel.

Open the Google Cloud Console and navigate to Cloud Logging. Provide the below filter criteria to identify and log as shown below:

resource.type="gce_instance" 
protoPayload.request.@type="type.googleapis.com/compute.instances.preempted"
protoPayload.authenticationInfo.principalEmail="system@google.com"
operation.producer="compute.instances.preempted"

Now click on Create Alert button

Provide the name of the Alert Policy and set the severity to Warning

Click Next. The filter criteria is already selected.

Click Next. You can specify the frequency and the duration of the incident

Clicl Next. Under Notifications, Click on Manage Notifications. Goto Web hook and click Add New. Provide the Cloud Function URL and and an appropriate display name. Click on Test Connection and click on Save.

Once done, go back to the wizard and select the appropriate checkbox for the webhook and click on Save.

The completes the integration of Alerting with Cloud Functions and Notification Channels.

You can test the flow by create a spot VM and then apply the simulation host maintenance as shown in the below link .

After creating the VM, you can run the below command

gcloud compute instances simulate-maintenance-event [vm-name] — zone [vm-zone]

Sign up to discover human stories that deepen your understanding of the world.

Sourabh Jain
Sourabh Jain

No responses yet

Write a response