Author: Elia Secchi, Machine Learning Engineer
In our previous blogs, we have covered key areas of MLOps and seen tools go head-to-head for Data Transformation (Tensorflow Transform vs. BigQuery), Orchestration (Kubeflow vs. Composer), and Model Serving (Cloud Functions and Cloud Run).
In our fourth blog, we will cover monitoring in GCP, in particular using Cloud Monitoring or Grafana to monitor ML systems in production.
Imagine this scenario:
After 1 month of research and a great effort from your team, you are ready to deploy your first recommender machine learning model to serve real users requests on your website.
You are highly confident that the model will perform well: your test set shows a promising accuracy in recommending products on your website. You are sure it will make a huge impact compared to the previous recommendation system based on simple heuristics.
At the end of the month, you are really curious to know how much profit generated the new model. The result? The model was a disaster in recommending products to the user even if there was no failure in responding to the requests.
The day after you are determined to debug your model and find the cause of the problem. After a couple of hours, you realise there was a leakage in your preprocessing code which made it difficult to notice the problem at modelling time.
Along with monitoring metrics like latency and percentage of failed requests, you realise now that a monitoring dashboard should be able to track in real time how your model is behaving when deployed so that you will be able to spot any ML related problem as soon as possible.
In a nutshell, this is what Monitoring should do for an ML system: tracking time metrics related to Software Engineering (such as latency and failed/successful requests) and metrics related to Machine Learning, (such as number of correct predictions, or drift detection) at the same time.
In this blog, we are going to focus on two great tools to monitor ML Systems in Google Cloud Platform through dashboards and alerts: Cloud Monitoring and Grafana.
Cloud Monitoring (see more here) is a service offered by Google Cloud that allows you to collect metrics, events and metadata from Google Cloud and other providers, using them to generate dashboards, charts and alerts.
Something we can notice from the definition I gave above is that monitoring performs two independent functions: metrics collection and metrics visualization. This is the first difference we can notice between Cloud Monitoring and Grafana, which is instead specialised only in metrics visualization and alerting.
In order to create a dashboard to monitor your ML system, Cloud Monitoring allows you to create custom dashboards with the Dashboard Editor, where you can add Line, Stacked area, Stacked Bar, Heatmap, Gauge, Scorecard and Text charts based on the metrics received by Cloud Monitoring. We can, for example, create a dashboard that groups together any relevant metric visualization about our ML System.
The alerting system instead allows you to trigger an alert based on a condition; for example, we can trigger an alert when the f1 score of our model drops over a certain threshold. If that happens, Cloud monitoring allows you to notify any stakeholder through different notification channels. At the moment Cloud Monitoring supports notifications over Mobile device notification, PagerDuty, Slack, Webhooks, Email, SMS and Cloud Pub/Sub.
Grafana (see more here) is a multi-platform open source monitoring solution. Grafana is able to query several metrics databases and display these metrics through dashboards. Similarly to Cloud Monitoring, it is also possible to define alerts based on a condition over a certain threshold and notify stakeholders through various channels like Slack or Email.
Grafana is currently the world’s most popular solution for creating monitoring dashboards and can be installed either in your own infrastructure for free or as a fully managed hosted solution with Grafana Cloud in case you don’t want to have to deal with the underlying infrastructure and operationalization.
At the moment, the possibilities of visualization in Cloud Monitoring are limited to only graphs created from metrics. In addition to that it is not possible to use external plugins.
Monitoring is a central part of any MLOps environment, and the choice between the monitoring services depends on the requirements of the MLOps platform.
Choose Cloud Monitoring if you require a managed and serverless monitoring service, you are already using managed GCP services, and you only require monitoring through metrics.
Choose Grafana if you require specific visualizations through the usage of plugins, if you require to integrate together different data sources or if you want to use your monitoring system for multi-cloud or hybrid architectures.
Our next and final blog post of the MLOps Tools series dwells into Feature Store, comparing BigQuery + Memorystore with FEAST. Check it out here.
For more info on how to get your MLOps journey right, make sure you download our 2021 Guide to MLOps here.
Know exactly where and how to start your AI journey with Datatonic’s
three-week AI Innovation Jumpstart *.
* Duration dependent on data complexity and use case chosen for POC model
With your own data sets, convince your business of the value of migrating your data warehouse, data lake and/or streaming platform to the cloud in four weeks.
With your own data, see how Looker can modernise your BI needs
with Datatonic’s two-week Showcase.