Vertex AI Tips and Tricks: Setting Up Alerts for Vertex Pipelines with Google Cloud Monitoring
Monitoring and alerting are essential parts of any production system. If your system isn’t working as expected, you want to be the first to know! Machine Learning (ML) systems are no different. Alerting can be particularly useful for ML systems; for long-running jobs, you don’t want to be sat watching the progress!
Vertex AI is integrated with Google Cloud Monitoring, so your Vertex Pipeline status is logged automatically for you, and you can easily create alerting on top of this.
In the example below, this blog post will show how you can set up email alerts to let you know when a Vertex Pipeline has failed in your project. You can also set up alerts across different communication channels — SMS, Slack, PagerDuty, and more.
Let’s dive in!
Getting Started
First, we should create a notification channel to let Google Cloud know where to send our email alerts. In the Google Cloud console, go to the alerting page, and click EDIT NOTIFICATION CHANNELS.
Next to Email, click ADD NEW. Enter the email address to use for the alerts and the display name for the email channel. If you want to send alerts to multiple addresses, you can either set up a Google group with multiple users, or you can set up multiple notification channels (one for each address).
(Beware of the bystander effect! Alerting multiple people can lead to diffusion of responsibility, whereby nobody acts on an alert because they know that others have also been notified, and each assumes that somebody else will deal with the problem!)
Now that we have added a notification channel, let’s head to the Logs Explorer in the Google Cloud console.
On the right-hand side, toggle the switch labelled Show query to reveal the query box. In the query box, paste in the following query:
resource.type="aiplatform.googleapis.com/PipelineJob"
jsonPayload.state="PIPELINE_STATE_FAILED"
Click on Create Alert. This will bring up a sidebar on the right-hand side called Create logs-based alert policy.
This will bring you to Step 1: Give your policy a name, and add a custom message to the Documentation field. The custom message supports variables and some notification channels support Markdown syntax. You can learn more about how to use these in the documentation. Below is an example you can use to provide a link to your broken pipeline in the console:
# Vertex Pipeline failed
The pipeline ${resource.label.pipeline_job_id} has failed. Click [here](https://console.cloud.google.com/vertex-ai/locations/${resource.label.location}/pipelines/runs/${resource.label.pipeline_job_id}?project=${project}) to view in the console
Click NEXT to move to Step 2: Choose logs to include in the alert. This should already be filled with the query you used previously.
Click NEXT to move to Step 3: Set notification frequency and autoclose duration. Specify the options here: Time between notifications describes the minimum time between notifications (so that you don’t get bombarded by lots of notifications at once!). There is a minimum value of 5 minutes, and this is what we would recommend for pipeline failures. Incident autoclose duration describes how long the system should wait to automatically close an incident once matching log entries are absent.
Click NEXT to move to Step 4: Who should be notified? In the dropdown box here, choose the notification channel that we set up earlier for email notifications. Finally, click SAVE.
Now we have set up our alerting policy, let’s try it out!
On our local machine, let’s install some Python dependencies:
pip install kfp==1.8.12 google-cloud-aiplatform==1.14.0
Next, let’s create a very basic Vertex Pipeline that will always fail:
https://medium.com/media/966754ebec364efbf060a0caa9678f2e
from kfp.v2 import dsl, compiler
from google.cloud import aiplatform
# Broken pipeline component
@dsl.component
def broken_task():
raise Exception("This task is broken!")
# Simple pipeline containing only the broken pipeline component
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
broken = broken_task()
if __name__ == '__main__':
# Compile the pipeline to pipeline.json
compiler.Compiler().compile(
pipeline_func=my_pipeline,
package_path="pipeline.json"
)
# Launch the pipeline on Vertex AI
pl = aiplatform.PipelineJob(
display_name="my pipeline",
template_path="pipeline.json",
pipeline_root="gs://<my_bucket>/pipeline_root",
project="<project_id>",
location="<region>",
)
pl.submit(
service_account="vertex-pipeline-runner@<project_id>.iam.gserviceaccount.com"
)
Fig 6: Example code to create a pipeline that fails
As expected, when we run this script, a pipeline is compiled and submitted to Vertex AI, which then fails because of the exception that we have raised.
Oh look, an email notification has come through!
If we click on VIEW INCIDENT, we are taken back to the console, where we can see that an incident has been created. We have the option to ACKNOWLEDGE INCIDENT or CLOSE INCIDENT. Acknowledge the incident to let others know that you are looking into the issue. Once you have resolved the problem, you can close the incident.
That’s it! Now you can run your Vertex Pipelines safe in the knowledge that if there are any issues, you’ll be the first to know. Don’t forget to follow us on Medium for more Vertex AI Tips and Tricks and much more!
Datatonic are Google Cloud’s Machine Learning Partner of the Year with a wealth of experience developing and deploying impactful Machine Learning models and MLOps Platform builds. Need help with developing an ML model, or deploying your Machine Learning models fast? Have a look at our MLOps 101 webinar, where our experts talk you through how to get started with Machine Learning at scale or get in touch to discuss your ML or MLOps requirements!