Insights

MLOps Tools Part 4: Cloud Monitoring vs. Grafana for Monitoring ML Systems

MLOps

In our previous blogs, we have covered key areas of MLOps and seen tools go head-to-head for Data Transformation (Tensorflow Transform vs. BigQuery), Orchestration (Kubeflow vs. Composer), and Model Serving (Cloud Functions and Cloud Run).

In our fourth blog, we will cover monitoring in Google Cloud, in particular using Cloud Monitoring or Grafana to monitor ML systems in production.

Monitoring ML systems

Imagine this scenario:

After 1 month of research and a great effort from your team, you are ready to deploy your first recommender machine learning model to serve real users requests on your website.

You are highly confident that the model will perform well: your test set shows a promising accuracy in recommending products on your website. You are sure it will make a huge impact compared to the previous recommendation system based on simple heuristics.

At the end of the month, you are really curious to know how much profit generated the new model. The result? The model was a disaster in recommending products to the user even if there was no failure in responding to the requests.

The day after you are determined to debug your model and find the cause of the problem. After a couple of hours, you realise there was a leakage in your preprocessing code which made it difficult to notice the problem at modelling time.

Along with monitoring metrics like latency and percentage of failed requests, you realise now that a monitoring dashboard should be able to track in real time how your model is behaving when deployed so that you will be able to spot any ML related problem as soon as possible.

In a nutshell, this is what Monitoring should do for an ML system: tracking time metrics related to Software Engineering (such as latency and failed/successful requests) and metrics related to Machine Learning, (such as number of correct predictions, or drift detection) at the same time.

In this blog, we are going to focus on two great tools to monitor ML Systems in Google Cloud Platform through dashboards and alerts: Cloud Monitoring and Grafana.

Cloud Monitoring

Cloud Monitoring (see more here) is a service offered by Google Cloud that allows you to collect metrics, events and metadata from Google Cloud and other providers, using them to generate dashboards, charts and alerts.

Something we can notice from the definition I gave above is that monitoring performs two independent functions: metrics collection and metrics visualization. This is the first difference we can notice between Cloud Monitoring and Grafana, which is instead specialised only in metrics visualization and alerting.

In order to create a dashboard to monitor your ML system, Cloud Monitoring allows you to create custom dashboards with the Dashboard Editor, where you can add Line, Stacked area, Stacked Bar, Heatmap, Gauge, Scorecard and Text charts based on the metrics received by Cloud Monitoring. We can, for example, create a dashboard that groups together any relevant metric visualization about our ML System.

The alerting system instead allows you to trigger an alert based on a condition; for example, we can trigger an alert when the f1 score of our model drops over a certain threshold. If that happens, Cloud monitoring allows you to notify any stakeholder through different notification channels. At the moment Cloud Monitoring supports notifications over Mobile device notification, PagerDuty, Slack, Webhooks, Email, SMS and Cloud Pub/Sub.

Example of ML Dashboard created with Cloud Monitoring

Reasons to choose Cloud Monitoring

Fully managed service in Google Cloud: Cloud Monitoring is exposed to every Google Cloud user as a fully managed service, so you don’t need to manage any infrastructure. In addition to that, the pricing model is particularly convenient as you will pay only when ingesting custom metric and based on how many read calls you perform to the service. The usage of Cloud Monitoring is likely to be free from costs given that for both ingestion and read, there is a free tier on the number of metrics ingested and read. The deployment of Grafana in GCP requires instead the creation of a VM or Kubernetes cluster that runs the service. This will mean incurring in costs for running the VM and be responsible for any downtime or problem at the infrastructure where Grafana runs. Grafana also offers a fully managed version through Grafana Cloud at a price that scales based on the number of users or volume of metrics.

Tight integration with Google Cloud products: Cloud Monitoring becomes really useful when using managed Google Cloud services, as for every service, a set of metrics is implemented and automatically sent to Cloud Monitoring. In addition to that, many services already include default dashboards created on the top 8of the metrics exported. When creating a service in Cloud Run, for example, you will have by default a set of monitoring dashboards created to monitor operational metrics like latency or number of failed requests. It’s possible to build dashboards relative to Google Cloud products in Grafana using Cloud Monitoring as a source but requires time for configuration and creation of the dashboard.

Easy to configure and to use: The fact is managed and already integrated with every Google Cloud service makes it easy for any team to start to use Cloud Monitoring. The installation of Grafana in Google Cloud requires a substantial amount of dev time as it needs to be installed in Kubernetes Engine or Compute Engine and associated with a Continuous Delivery pipeline.

Grafana

Grafana (see more here) is a multi-platform open source monitoring solution. Grafana is able to query several metrics databases and display these metrics through dashboards. Similarly to Cloud Monitoring, it is also possible to define alerts based on a condition over a certain threshold and notify stakeholders through various channels like Slack or Email.

Grafana is currently the world’s most popular solution for creating monitoring dashboards and can be installed either in your own infrastructure for free or as a fully managed hosted solution with Grafana Cloud in case you don’t want to have to deal with the underlying infrastructure and operationalization.

Example of ML dashboard in Grafana implemented for Seldon Core

Reasons to choose Grafana:

Dashboard customisations: when compared to Cloud Monitoring, Grafana allows the users to customize their dashboards with a wide variety of charts, many of them provided by the community. Imagine doing video object detection and being able to have a live video stream to see what our ML model is predicting in real time.

Plugins & multiple data sources: the great customisations capabilities in Grafana are made possible by the presence of plugins that can be added to the Grafana environment. There are 3 types of plugin available:

Panel, allowing Grafana to display new data visualizations
Data Source allowing Grafana to connect to different databases and data sources, this includes a plugin to connect Cloud Monitoring as Data Source.
App, allowing the installation of a Standalone App inside Grafana including both data sources and panels.

At the moment, the possibilities of visualization in Cloud Monitoring are limited to only graphs created from metrics. In addition to that it is not possible to use external plugins.

Kubernetes based: we’ve already discussed that a reason to choose Cloud Monitoring over Grafana is the fact that the first one is Serverless and Managed. Depending on the architecture and on the requirements, this can be either an advantage or a disadvantage of Grafana. We may need for example to use Grafana when we require to analyze a great amount of custom metric as in Grafana, there is no difference between custom and not custom metric. In addition to that, Grafana can be installed potentially everywhere where Kubernetes runs, allowing at the same time for multi-cloud or hybrid monitoring dashboards and for portability of the service. In Cloud Monitoring is possible to monitor multi-cloud environments but is not recommended as you may incur high costs when ingesting these metrics.

**Grafana dashboard for Object detection from** Build an AI-driven object detection algorithm with balenaOS and alwaysAI

Summary

Monitoring is a central part of any MLOps environment, and the choice between the monitoring services depends on the requirements of the MLOps platform.

Choose Cloud Monitoring if you require a managed and serverless monitoring service, you are already using managed Google Cloud services, and you only require monitoring through metrics.

Choose Grafana if you require specific visualizations through the usage of plugins, if you require to integrate together different data sources or if you want to use your monitoring system for multi-cloud or hybrid architectures.

Our next and final blog post of the MLOps Tools series dwells into Feature Store, comparing BigQuery + Memorystore with FEAST. Check it out here.

For more info on how to get your MLOps journey right, make sure you download our 2021 Guide to MLOps here.

MLOps

View all

Insights

Google Cloud Next 2025: The Future of AI

Generative AI

Looker

Insights

Agentspace in Finance: Streamline Compliance, Manage Risk, and Enhance Customer Experiences

Agentspace

Generative AI

Insights

Revamping Retail: How Agentspace is Changing the Industry

Agentspace

Generative AI

Generative AI

Retail Assistant

Gameplay Assistant

Fan Engagement

Media Search + Discovery

Creative Assistant

Agentspace

Agentspace for Retail

Agentspace for FSI

Agentspace for Media

Cloud Data Migration

Managed Services

Marketing Analytics

Looker

Retail

Telecommunications

Gaming

Media

Financial Services

Events

Whitepapers

Insights

MLOps Tools Part 4: Cloud Monitoring vs. Grafana for Monitoring ML Systems

Monitoring ML systems

Cloud Monitoring

Reasons to choose Cloud Monitoring

Grafana

Reasons to choose Grafana:

Summary

Related

Google Cloud Next 2025: The Future of AI

Agentspace in Finance: Streamline Compliance, Manage Risk, and Enhance Customer Experiences

Revamping Retail: How Agentspace is Changing the Industry