This site uses cookies. By continuing to browse, you agree to our use of cookies as outlined in our Privacy and Cookie Policy.
Author: Steve Mudute-Ndumbe, Machine Learning Engineer
Part one and two of our MLOps tools series gave you an insight into data transformation tools ( TensorFlow Transform vs. BigQuery) and orchestration tools (Kubeflow Pipelines vs. Cloud Composer). In this post, we will look at model serving on Google Cloud Platform (GCP), with a deep dive into two common options: Cloud Functions and Cloud Run. We will discuss reasons for using both tools, as well as highlighting our experiences of them in practice.
Before we dive into these two tools, we first need to understand the different ways that you can obtain predictions from a machine learning model. Once a model has been created and saved, there are two options for using it:
The first type is known as offline predictions, and is common when there isn’t a strict latency requirement in predictions (for example, providing propensity scores to help align marketing campaigns). Here, of key importance is a model being able to handle large blocks of request data coming in all at once.
The second type is known as online predictions, and is much more common when models are being used in a real-time scenario (for example, providing real-time content recommendations on a website). Challenges here are more extensive and include being able to handle differences in volumes of requests; supplying responses as quickly as possible; and ensuring minimal or no downtime in deploying updated models.
For this blogpost, we will discuss Cloud Functions and Cloud Run in the context of serving online requests on Google Cloud Platform.
Cloud Functions (see more here) is Google Cloud’s FaaS (Functions as a Service) offering, allowing you to deploy ad-hoc functions on Google Cloud Platform without worrying about the underlying infrastructure. There are several ways of triggering these functions; for model serving, users typically wrap the model in a simple Python function, which will accept data from a HTTP request and return model predictions.
A typical serving architecture might look like the following:
Here, a HTTP request is received, which triggers a Python function to be executed. The function takes in JSON data from the request, passes this to the model which returns predictions, and these predictions are returned in the response to the HTTP request.
Cloud Functions was the serving tool of choice in our recent engagement with News UK. Its ease of use made it simple to integrate with the client’s existing APIs, whilst still performing under strict latency requirements. It was also very simple to deploy our chosen model architecture with Cloud Functions, and the generated endpoint performed well under stress testing simulating the expected number of requests in production.
Cloud Run (see more here) is a managed version of the open source project Knative on Google Kubernetes Engine. It allows you to easily serve models that have been deployed in a container, without needing to worry about the underlying compute infrastructure. If you already have an existing Kubernetes cluster, you can also specify to serve a model in that cluster (using Cloud Run for Anthos), otherwise, Cloud Run will deploy the service in a Kubernetes cluster managed by Google Cloud.
Now, when a HTTP request is made, it is routed to a particular machine within a Kubernetes cluster, which contains the model pulled from Google Container Registry, and is deployed via Cloud Run. The model accepts data sent to the container and returns predictions, and these predictions are returned in response to the HTTP request.
Cloud Run was the clear suggestion for a recent project we undertook on fraud detection with a world leader in payments processing. The ability of Cloud Run to manage traffic between containers means it would be perfect for implementing a Champion/Challenger model infrastructure on a Kubernetes cluster which the client could maintain. Moreover, with Cloud Run for Anthos, there is also the ability to utilise GPUs in serving on GCP, allowing us to use model frameworks and libraries specifically optimised for GPU resource consumption. Finally, the client wanted more control over the scaling of the serving infrastructure, in order to easily handle the extreme volume of transactions they process.
Overall, Cloud Functions is a useful tool to quickly deploy a simple model, whilst making use of a typical data scientist’s existing skillset. If you have more engineering expertise at hand, need to make use of custom binaries and libraries, or need a more complex serving architecture to incorporate with your own infrastructure, then we heavily recommend Cloud Run as the more suitable option.
Cost-wise, Cloud Functions and Cloud Run are comparable when handling a small number of single requests synchronously. However, when your request load increases, and you need to handle requests concurrently, Cloud Run quickly becomes the cheaper and faster option.
We hope you enjoyed our MLOps tool comparison blog for Model Serving. In the next part, we look at different tools for monitoring model performance on Google Cloud Platform.
Want to know more about MLOps? Download our 2021 Guide to MLOps here.
Know exactly where and how to start your AI journey with Datatonic’s
three-week AI Innovation Jumpstart *.
* Duration dependent on data complexity and use case chosen for POC model
With your own data sets, convince your business of the value of migrating your data warehouse, data lake and/or streaming platform to the cloud in four weeks.
With your own data, see how Looker can modernise your BI needs
with Datatonic’s two-week Showcase.
If you are a retail business and want to better understand your customers to provide them with personalised content and recommendations, get in touch to discuss a free trial to jumpstart your personalisation journey in four weeks.