MLOps Tools Part 4: Cloud Monitoring vs. Grafana for Monitoring ML Systems
In our previous blogs, we have covered key areas of MLOps and seen tools go head-to-head for Data Transformation (Tensorflow Transform vs. BigQuery), Orchestration (Kubeflow vs. Composer), and Model Serving (Cloud Functions and Cloud Run).
In our fourth blog, we will cover monitoring in Google Cloud, in particular using Cloud Monitoring or Grafana to monitor ML systems in production.
Monitoring ML systems
Imagine this scenario:
After 1 month of research and a great effort from your team, you are ready to deploy your first recommender machine learning model to serve real users requests on your website.
You are highly confident that the model will perform well: your test set shows a promising accuracy in recommending products on your website. You are sure it will make a huge impact compared to the previous recommendation system based on simple heuristics.
At the end of the month, you are really curious to know how much profit generated the new model. The result? The model was a disaster in recommending products to the user even if there was no failure in responding to the requests.
The day after you are determined to debug your model and find the cause of the problem. After a couple of hours, you realise there was a leakage in your preprocessing code which made it difficult to notice the problem at modelling time.
Along with monitoring metrics like latency and percentage of failed requests, you realise now that a monitoring dashboard should be able to track in real time how your model is behaving when deployed so that you will be able to spot any ML related problem as soon as possible.
In a nutshell, this is what Monitoring should do for an ML system: tracking time metrics related to Software Engineering (such as latency and failed/successful requests) and metrics related to Machine Learning, (such as number of correct predictions, or drift detection) at the same time.
In this blog, we are going to focus on two great tools to monitor ML Systems in Google Cloud Platform through dashboards and alerts: Cloud Monitoring and Grafana.
Cloud Monitoring
Cloud Monitoring (see more here) is a service offered by Google Cloud that allows you to collect metrics, events and metadata from Google Cloud and other providers, using them to generate dashboards, charts and alerts.
Something we can notice from the definition I gave above is that monitoring performs two independent functions: metrics collection and metrics visualization. This is the first difference we can notice between Cloud Monitoring and Grafana, which is instead specialised only in metrics visualization and alerting.
In order to create a dashboard to monitor your ML system, Cloud Monitoring allows you to create custom dashboards with the Dashboard Editor, where you can add Line, Stacked area, Stacked Bar, Heatmap, Gauge, Scorecard and Text charts based on the metrics received by Cloud Monitoring. We can, for example, create a dashboard that groups together any relevant metric visualization about our ML System.
The alerting system instead allows you to trigger an alert based on a condition; for example, we can trigger an alert when the f1 score of our model drops over a certain threshold. If that happens, Cloud monitoring allows you to notify any stakeholder through different notification channels. At the moment Cloud Monitoring supports notifications over Mobile device notification, PagerDuty, Slack, Webhooks, Email, SMS and Cloud Pub/Sub.
Reasons to choose Cloud Monitoring
- Fully managed service in Google Cloud: Cloud Monitoring is exposed to every Google Cloud user as a fully managed service, so you don’t need to manage any infrastructure. In addition to that, the pricing model is particularly convenient as you will pay only when ingesting custom metric and based on how many read calls you perform to the service. The usage of Cloud Monitoring is likely to be free from costs given that for both ingestion and read, there is a free tier on the number of metrics ingested and read. The deployment of Grafana in GCP requires instead the creation of a VM or Kubernetes cluster that runs the service. This will mean incurring in costs for running the VM and be responsible for any downtime or problem at the infrastructure where Grafana runs. Grafana also offers a fully managed version through Grafana Cloud at a price that scales based on the number of users or volume of metrics.
- Tight integration with Google Cloud products: Cloud Monitoring becomes really useful when using managed Google Cloud services, as for every service, a set of metrics is implemented and automatically sent to Cloud Monitoring. In addition to that, many services already include default dashboards created on the top 8of the metrics exported. When creating a service in Cloud Run, for example, you will have by default a set of monitoring dashboards created to monitor operational metrics like latency or number of failed requests. It’s possible to build dashboards relative to Google Cloud products in Grafana using Cloud Monitoring as a source but requires time for configuration and creation of the dashboard.
- Easy to configure and to use: The fact is managed and already integrated with every Google Cloud service makes it easy for any team to start to use Cloud Monitoring. The installation of Grafana in Google Cloud requires a substantial amount of dev time as it needs to be installed in Kubernetes Engine or Compute Engine and associated with a Continuous Delivery pipeline.
Grafana
Grafana (see more here) is a multi-platform open source monitoring solution. Grafana is able to query several metrics databases and display these metrics through dashboards. Similarly to Cloud Monitoring, it is also possible to define alerts based on a condition over a certain threshold and notify stakeholders through various channels like Slack or Email.
Grafana is currently the world’s most popular solution for creating monitoring dashboards and can be installed either in your own infrastructure for free or as a fully managed hosted solution with Grafana Cloud in case you don’t want to have to deal with the underlying infrastructure and operationalization.
Reasons to choose Grafana:
- Dashboard customisations: when compared to Cloud Monitoring, Grafana allows the users to customize their dashboards with a wide variety of charts, many of them provided by the community. Imagine doing video object detection and being able to have a live video stream to see what our ML model is predicting in real time.
- Plugins & multiple data sources: the great customisations capabilities in Grafana are made possible by the presence of plugins that can be added to the Grafana environment. There are 3 types of plugin available:
- Panel, allowing Grafana to display new data visualizations
- Data Source allowing Grafana to connect to different databases and data sources, this includes a plugin to connect Cloud Monitoring as Data Source.
- App, allowing the installation of a Standalone App inside Grafana including both data sources and panels.
At the moment, the possibilities of visualization in Cloud Monitoring are limited to only graphs created from metrics. In addition to that it is not possible to use external plugins.
- Kubernetes based: we’ve already discussed that a reason to choose Cloud Monitoring over Grafana is the fact that the first one is Serverless and Managed. Depending on the architecture and on the requirements, this can be either an advantage or a disadvantage of Grafana. We may need for example to use Grafana when we require to analyze a great amount of custom metric as in Grafana, there is no difference between custom and not custom metric. In addition to that, Grafana can be installed potentially everywhere where Kubernetes runs, allowing at the same time for multi-cloud or hybrid monitoring dashboards and for portability of the service. In Cloud Monitoring is possible to monitor multi-cloud environments but is not recommended as you may incur high costs when ingesting these metrics.
Summary
Monitoring is a central part of any MLOps environment, and the choice between the monitoring services depends on the requirements of the MLOps platform.
Choose Cloud Monitoring if you require a managed and serverless monitoring service, you are already using managed Google Cloud services, and you only require monitoring through metrics.
Choose Grafana if you require specific visualizations through the usage of plugins, if you require to integrate together different data sources or if you want to use your monitoring system for multi-cloud or hybrid architectures.
Our next and final blog post of the MLOps Tools series dwells into Feature Store, comparing BigQuery + Memorystore with FEAST. Check it out here.
For more info on how to get your MLOps journey right, make sure you download our 2021 Guide to MLOps here.