Insights

Vertex AI Tips and Tricks: Setting Up Alerts for Vertex Pipelines with Google Cloud Monitoring

Vertex AI

Monitoring and alerting are essential parts of any production system. If your system isn’t working as expected, you want to be the first to know! Machine Learning (ML) systems are no different. Alerting can be particularly useful for ML systems; for long-running jobs, you don’t want to be sat watching the progress!

Vertex AI is integrated with Google Cloud Monitoring, so your Vertex Pipeline status is logged automatically for you, and you can easily create alerting on top of this.

In the example below, this blog post will show how you can set up email alerts to let you know when a Vertex Pipeline has failed in your project. You can also set up alerts across different communication channels — SMS, Slack, PagerDuty, and more.

Let’s dive in!

 

Getting Started

First, we should create a notification channel to let Google Cloud know where to send our email alerts. In the Google Cloud console, go to the alerting page, and click EDIT NOTIFICATION CHANNELS.

Next to Email, click ADD NEW. Enter the email address to use for the alerts and the display name for the email channel. If you want to send alerts to multiple addresses, you can either set up a Google group with multiple users, or you can set up multiple notification channels (one for each address).

(Beware of the bystander effect! Alerting multiple people can lead to diffusion of responsibility, whereby nobody acts on an alert because they know that others have also been notified, and each assumes that somebody else will deal with the problem!)

 

Fig 1: Creating an Email Channel

Now that we have added a notification channel, let’s head to the Logs Explorer in the Google Cloud console.

On the right-hand side, toggle the switch labelled Show query to reveal the query box. In the query box, paste in the following query:

 

resource.type="aiplatform.googleapis.com/PipelineJob"
jsonPayload.state="PIPELINE_STATE_FAILED"

 

Click on Create Alert. This will bring up a sidebar on the right-hand side called Create logs-based alert policy.

 

Alert details for Vertex Pipeline failures
Fig 2: Setting the Alert details

This will bring you to Step 1: Give your policy a name, and add a custom message to the Documentation field. The custom message supports variables and some notification channels support Markdown syntax. You can learn more about how to use these in the documentation. Below is an example you can use to provide a link to your broken pipeline in the console:

 

# Vertex Pipeline failed

The pipeline ${resource.label.pipeline_job_id} has failed. Click [here](https://console.cloud.google.com/vertex-ai/locations/${resource.label.location}/pipelines/runs/${resource.label.pipeline_job_id}?project=${project}) to view in the console

 

Click NEXT to move to Step 2: Choose logs to include in the alert. This should already be filled with the query you used previously.

 

Choosing logs to include in the alert
Fig 3: Choosing logs to include in the alert

Click NEXT to move to Step 3: Set notification frequency and autoclose duration. Specify the options here: Time between notifications describes the minimum time between notifications (so that you don’t get bombarded by lots of notifications at once!). There is a minimum value of 5 minutes, and this is what we would recommend for pipeline failures. Incident autoclose duration describes how long the system should wait to automatically close an incident once matching log entries are absent.

 

Choosing the autoclose durationg
Fig 4: Setting notification frequency and autoclose duration

Click NEXT to move to Step 4: Who should be notified? In the dropdown box here, choose the notification channel that we set up earlier for email notifications. Finally, click SAVE.

 

Choosing who should be notified
Fig 5: Choosing who should be notified

Now we have set up our alerting policy, let’s try it out!

On our local machine, let’s install some Python dependencies:

 

pip install kfp==1.8.12 google-cloud-aiplatform==1.14.0

 

Next, let’s create a very basic Vertex Pipeline that will always fail:

https://medium.com/media/966754ebec364efbf060a0caa9678f2e

from kfp.v2 import dsl, compiler

from google.cloud import aiplatform

# Broken pipeline component
@dsl.component
def broken_task():
    raise Exception("This task is broken!")

# Simple pipeline containing only the broken pipeline component
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
    broken = broken_task()

if __name__ == '__main__':
    # Compile the pipeline to pipeline.json
    compiler.Compiler().compile(
        pipeline_func=my_pipeline,
        package_path="pipeline.json"
    )

    # Launch the pipeline on Vertex AI
    pl = aiplatform.PipelineJob(
        display_name="my pipeline",
        template_path="pipeline.json",
        pipeline_root="gs://<my_bucket>/pipeline_root",
        project="<project_id>",
        location="<region>",
    )

    pl.submit(
        service_account="vertex-pipeline-runner@<project_id>.iam.gserviceaccount.com"
    )

Fig 6: Example code to create a pipeline that fails

As expected, when we run this script, a pipeline is compiled and submitted to Vertex AI, which then fails because of the exception that we have raised.

A broken Vertex pipeline
Fig 7: A broken pipeline

Oh look, an email notification has come through!

Vertex Pipeline failures log alert
Fig 8: Example alert email for a failed Vertex Pipeline

If we click on VIEW INCIDENT, we are taken back to the console, where we can see that an incident has been created. We have the option to ACKNOWLEDGE INCIDENT or CLOSE INCIDENT. Acknowledge the incident to let others know that you are looking into the issue. Once you have resolved the problem, you can close the incident.

That’s it! Now you can run your Vertex Pipelines safe in the knowledge that if there are any issues, you’ll be the first to know. Don’t forget to follow us on Medium for more Vertex AI Tips and Tricks and much more!

Datatonic are Google Cloud’s Machine Learning Partner of the Year with a wealth of experience developing and deploying impactful Machine Learning models and MLOps Platform builds. Need help with developing an ML model, or deploying your Machine Learning models fast? Have a look at our MLOps 101 webinar, where our experts talk you through how to get started with Machine Learning at scale or get in touch to discuss your ML or MLOps requirements!

Related
View all
View all
Partner of the Year Awards
Insights
Datatonic Wins Four 2024 Google Cloud Partner of the Year Awards
Women in Data and Analytics
Insights
Coding Confidence: Inspiring Women in Data and Analytics
Prompt Engineering
Insights
Prompt Engineering 101: Using GenAI Effectively
Generative AI